University of Houston, USA

Slides:



Advertisements
Similar presentations
An Interactive-Voting Based Map Matching Algorithm
Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Multi-Scale Analysis of Crime and Incident Patterns in Camden Dawn Williams Department of Civil, Environmental & Geomatic.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.
Hongliang Li, Senior Member, IEEE, Linfeng Xu, Member, IEEE, and Guanghui Liu Face Hallucination via Similarity Constraints.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
The Evolution of Spatial Outlier Detection Algorithms - An Analysis of Design CSci 8715 Spatial Databases Ryan Stello Kriti Mehra.
Patch to the Future: Unsupervised Visual Prediction
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Multiple Human Objects Tracking in Crowded Scenes Yao-Te Tsai, Huang-Chia Shih, and Chung-Lin Huang Dept. of EE, NTHU International Conference on Pattern.
Birch: An efficient data clustering method for very large databases
Data Mining Techniques
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,
Name: Sujing Wang Advisor: Dr. Christoph F. Eick
A N A RCHITECTURE AND A LGORITHMS FOR M ULTI -R UN C LUSTERING Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science.
Trajectory Pattern Mining
Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
H. Lexie Yang1, Dr. Melba M. Crawford2
August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.
Presented by Ho Wai Shing
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Change Analysis in Spatial Datasets by Interestingness Comparison Vadeerat Rinsurongkawong, and Christoph F. Eick Department of Computer Science, University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
AegisDB: Integrated realtime geo-stream processing and monitoring system Chengyang Zhang Computer Science Department University of North Texas.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
1 A latent information function to extend domain attributes to improve the accuracy of small-data-set forecasting Reporter : Zhao-Wei Luo Che-Jung Chang,Der-Chiang.
What Else is Important in AI we Did not Cover?
Parametric calibration of speed–density relationships in mesoscopic traffic simulator with data mining Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2009/10/20.
More on Clustering in COSC 4335
3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.
Database management system Data analytics system:
Fast nearest neighbor searches in high dimensions Sami Sieranoja
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Parallel Density-based Hybrid Clustering
Real-time Wall Outline Extraction for Redirected Walking
DASH Background Server provides multiple qualities of the same video
Mining Spatio-Temporal Reachable Regions over Massive Trajectory Data
Mean Shift Segmentation
Parametric calibration of speed–density relationships in mesoscopic traffic simulator with data mining Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2009/10/20.
Research Focus Objectives: The Data Analysis and Intelligent Systems (DAIS) Lab  aims at the development of data analysis, data mining, GIS and artificial.
ST-COPOT---Spatial Temporal Clustering with Contour Polygon Trees
3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.
Spatio-temporal Pattern Queries
Research Focus Objectives: The Data Analysis and Intelligent Systems (DAIS) Lab  aims at the development of data analysis, data mining, GIS and artificial.
6. Introduction to nonparametric clustering
Community Distribution Outliers in Heterogeneous Information Networks
Yongli Zhang, Sujing Wang, Amar Mani Aryal, and Christoph F. Eick
Yongli Zhang and Christoph F. Eick University of Houston, USA
Data Analysis and Intelligent Systems Lab
Section 4: see other Slide Show
Section 4: see other Slide Show
DATA MINING Introductory and Advanced Topics Part II - Clustering
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Brainstorming How to Analyze the 3AuCountHand Datasets
GPX: Interactive Exploration of Time-series Microarray Data
Spatial Data Mining Definition: Spatial data mining is the process of discovering interesting patterns from large spatial datasets; it organizes by location.
Volume 5, Issue 4, Pages e4 (October 2017)
EE 492 ENGINEERING PROJECT
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Topological Signatures For Fast Mobility Analysis
CSE572: Data Mining by H. Liu
Donghui Zhang, Tian Xia Northeastern University
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

University of Houston, USA “Serial” versus “Parallel”: a Comparison of Spatio-temporal Clustering Approaches Yongli Zhang, Sujing Wang, Amar Mani Aryal, and Christoph F. Eick University of Houston, USA 23rd International Symposium on Methodologies for Intelligent Systems (ISMIS 2017) June 26, 2017, Warsaw, Poland Data Analysis and Intelligent Systems Lab

Data Analysis and Intelligent Systems Lab Talk Outline Introduction ST-DPOLY ST-SNN Experimental Results Comparison Future Work Data Analysis and Intelligent Systems Lab

Data Analysis and Intelligent Systems Lab 1. Introduction With the development of remote sensors and sensor networks, different types of spatio-temporal datasets become increasingly available these days. Our works centers on developing spatio-temporal clustering and hotspot discovery algorithms that are capable of identifying dense regions in the location/time space. r Data Analysis and Intelligent Systems Lab

Data Analysis and Intelligent Systems Lab Past Work Kulldorff et al. [1] introduce a spatial scan statistic for the detection of spatio-temporal cylinders where the point objects occur consistently for a significant period of time. Iyengar et al. [2] extend the basic scan statistics using the flexible square pyramid shape to detect clusters with restrictive shapes, and the proposed framework can model growth and shifts in location over time. Wang et al. [3] propose a spatiotemporal clustering algorithm ST-GRID which maps the spatial and temporal dimensions into multidimensional cells and then extract and merge spatio-temporal dense regions to obtain a final cluster. Birant et al. [4] propose ST-DBSCAN as an extension of DBSCAN for spatio-temporal clustering by introducing a second parameter of temporal neighborhood radius in addition to the spatial neighborhood radius. Data Analysis and Intelligent Systems Lab

Data Analysis and Intelligent Systems Lab Motivation However, most spatio-temporal clustering algorithms, are not suitable to deal with large data streams, as they: pass over the data several times cannot deal with very large datasets use time and location in a parallel fashion For example, ST-DBSCAN [4] treats time and location in parallel, assume that the dataset fits into the main memory instead of coming in batches as a stream, and also they scan through dataset multiple times during clustering process. Moreover, comparison between serial approach and parallel approach is rarely investigated in the literature. Data Analysis and Intelligent Systems Lab

Data Analysis and Intelligent Systems Lab Talk Outline Introduction ST-DPOLY ST-SNN Experimental Results Comparison Future Work Data Analysis and Intelligent Systems Lab

2. ST-DPOLY ST-DPOLY—A serial approach: Subdivides the incoming data into batches Generates spatial clusters for each batch first Next, spatio-temporal clusters are formed by identifying continuing relationships between spatial clusters in consecutive batches. Data Analysis and Intelligent Systems Lab

Data Analysis and Intelligent Systems Lab ST-DPOLY cont. Inputs Point cloud stream; (e.g. taxi pickup location cloud streams that are described by the location using longitude and latitude, and the pickup time.) Data collection area; (e.g. New York metropolitan area.) Outputs: Spatio-temporal cluster which are graphs of related spatial clusters Data Analysis and Intelligent Systems Lab

The Three Phases of ST-DPOLY 1. Obtain spatial density function for spatial point cloud collected in each batch. 2. Identify spatial clusters for each batch as polygons that are created from density contour lines of the spatial density function. 3. Identify relationships between spatial clusters in consecutive batches, and construct spatio-temporal clusters as continuing spatial clusters in consecutive batches. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

ST-DPOLY Phase1 In our approach, we use non-parametric kernel density estimation (KDE) [8] to obtain a 2-dimensional spatial density function f. The kernel density estimator is defined as follows: Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

ST-DPOLY Phase2 1. Grid the data collection area. 2. Calculate a probability density for all grid intersection points using the spatial density function (given in Eq. 1), and obtain a density matrix. Create a table T to store locations of all grid intersection points and corresponding density matrix. 3. Pass T, along with a pair of density threshold to CONREC, which returns two sets of contour lines. 4. Close open contour lines. 5. Classify the obtained contour lines into holes and spatial clusters; 6. Construct spatial cluster polygon using the information obtained in in step 5. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

Phase 3 ST-DPOLY We define an overlap matrix, for two cluster sets and obtained for every two consecutive batches i, i+1, with having a list of clusters (with cluster ), and X’ having a list of N clusters (with cluster ), establish a matrix , an entry of which is calculated as follows: If two spatial clusters at two consecutive batches have significant overlap, we conclude that the spatial cluster doesn’t change significantly over two consecutive batches, and create a 'continuing' relationship between the two clusters. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

Data Analysis and Intelligent Systems Lab Talk Outline Introduction ST-DPOLY ST-SNN Experimental Results Comparison Future Work Data Analysis and Intelligent Systems Lab

3. ST-SNN ST-SNN is a “parallel” appraoch, which is based on the well- established generic clustering algorithm-Shared Nearest Neighbor (SNN). It relies on a distance function that combines spatial and temporal distances. It processes all the input data together and uses distance function to compute shared neighbors for pairs of spatio- temporal objects. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

Data Analysis and Intelligent Systems Lab Talk Outline Introduction ST-DPOLY ST-SNN Experimental Results Comparison Future Work Data Analysis and Intelligent Systems Lab

4. Experimental Results Dataset: NYC taxi trip dataset [6] Contain data for over 1.1 billion taxi trips from January 2009 through June 2016. Each individual trip record contains precise location coordinates from where the trip started and ended, timestamps when the trip started and ended, trip distance and fares. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

Taxi-Cab Results ST-DPOLY For the experiment, we use yellow taxi pick-up locations collected in 20 minutes interval as batches. We analyzed taxi pickups from 6 to 7 AM on January 8th (2014). For 6 to 6:20 batch, we obtained 3 clusters, for the 6:20-6:40 batch, we obtained 2 clusters and we obtained 4 clusters for the 6:40-7 batch. According to the result, as far as pick-up is concerned, east of Midtown of New York is a hotspot which is crowded with people looking for taxis early in the morning, as well as the region centered around the Grand Central Terminal. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

Taxi-Cab Results ST-SNN We use Euclidean distance to compute the spatial distance. There are 16 clusters obtained. Figure above visualizes clusters 2, 13 and 14, and they are centered around several bus terminals and train stations, which shows a similar pattern of the clustering results generated by ST-DPOLY. Cluster 2 and 13 are similar in the spatial domain. However, the time slots corresponding to these two clusters are different. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

Data Analysis and Intelligent Systems Lab Talk Outline Introduction ST-DPOLY ST-SNN Experimental Results Comparison Future Work Data Analysis and Intelligent Systems Lab

5. Comparison 5.1 Time/Space complexity: Note: grid size we use is m*m, n is the total number of points, e is the average number of edges a spatial cluster has. Since ST-DPOLY is a serial approach, its overall time complexity is 𝑂( 𝑚 2 ×𝑛), in cases that the number of data points is much larger than the number of grid cells (n >> m2), ST-DPOLY's complexity becomes 𝑂 𝑛 . ST-DPOLY is superior to ST-SNN in terms of both time and space complexity. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

5. Comparison 5.2 Temporal Flexibility In terms of temporal flexibility, ST-SNN is more flexible as cluster have more variation with respect to temporal mean and standard deviation. Though time intervals can be selected based on application needs, it is fixed throughout the clustering process once selected. Therefore, ST-SNN has the potential to detect “more optimal” time intervals. However, the clustering result of ST-DPOLY is more straightforward, and in terms of clustering data streams, ST-DPOLY as a serial approach is more appropriate. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

5. Comparison 5.3 Quality of Clusters To compare the quality of clustering result, we measure the variation of clusters obtained: The clusters generated by ST-SNN have smaller values of standard deviation and range of time, longitude, and latitude than clusters identied by ST-DPLOY. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

6. Future Work Enable ST-DPLOY to support dynamic or adaptive batch sizes. Investigate semi-automatic and automatic parameter selection tools to facilitate the use of ST- DPLOY. Conduct a more thorough experimental evaluation. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

Future Work cont. Extend our approach to support multiple density thresholds. Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

References Martin Kulldor. Spatial scan statistics: models, calculations, and applications. In Scan statistics and applications, pages 303{322. Springer, 1999. Vijay S Iyengar. On detecting space-time clusters. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 587{592. ACM, 2004. Min Wang, Aiping Wang, and Anbo Li. Mining spatial-temporal clusters from geo-databases. In International Conference on Advanced Data Mining and Applications, pages 263{270. Springer, 2006. Derya Birant and Alp Kut. St-dbscan: An algorithm for clustering spatial{temporal data. Data & Knowledge Engineering, 60(1):208{221, 2007. Paul D Bourke. A contouring subroutine. Byte, 12(6):143{150, 1987. http://www.nyc.gov/html/tlc/html/about/trip record data.shtml, (accessed August 23, 2016). Wang, S., Cai, T., Eick, C.F.: New spatiotemporal clustering algorithms and their applications to ozone pollution. In: Data MiningWorkshops (ICDMW), 2013 IEEE 13th International Conference on, IEEE (2013) 1061-1068 Data Analysis and Intelligent Systems Lab Data Analytics and Artificial Systems Lab

Data Analysis and Intelligent Systems Lab Any Questions? Data Analysis and Intelligent Systems Lab