Density-based Place Clustering in Geo-Social Networks Jieming Shi, Nikos Mamoulis, Dingming Wu, David W. Cheung Department of Computer Science, The University of Hong Kong
Clustering Spatial clustering – grouping of spatial objects (geographic places in our case) into clusters Useful for marketing and urban planning Density based clustering divides a large collection of points into densely populated regions
DBSCAN algorithm DBSCAN is one of the most common data clustering algorithms – proposed in 1996 For each place p it finds all the places within the radius ε of p – eps-neighborhood. If the number of places in eps-neighborhood is no less than MinPts – p is called a core point -> it will form a cluster or will be a part of cluster Dense eps-neighborhoods are put into the same cluster if they contain the cores of each other
Example ε ε MinPts = 4 ε 1 finish 3 2 …
DBSCAN result example
Use of geo-social network data Current spatial clustering models disregard information about the people who are related to the clustered places. Social Network with geographic checkins includes: Users Friendship connections Checkins
Motivation Urban planning: land managers are interested in identifying regions with uniform demographic statistics (for example, areas where elderly people prefer to visit or areas with people that have in common special transportation or living needs) Data cleaning: nearby Geo-Social Network locations collected by user check-ins could belong to the same physical place Marketing: if two or more places belong to the same geo-social cluster, the user who likes one place will probably be interested to visit the others
users places friendship connections checkins
Example 1 Example 2
Density-based Clustering Places in Geo-Social Networks (DCPGS)
Input
DCPGS - Geo-social ε-neighborhood definition
DCPGS algorithm idea
Distance functions
Social distance
Alternative ways to compute social distance – (1) Jaccard
Alternative ways to compute social distance – (2) SimRank
Alternative ways to compute social distance – (3) Katz
Alternative ways to compute social distance – (4) Commute Time
Algorithms DCPGS-R and DCPGS-G
DCPGS-R: R-tree based The algorithm uses R-Tree to facilitate the search of geo-social ε-neighborhood for a given place For the sake of efficiency the social network is stored in a hash table – each pair of friends as an entry
Spatial query – uses R-tree The distance has already been computed Compute social and geo-social distance
DCPGS-G: Grid-based Individual R-tree based range queries find all the places within the radius maxD of the given geographic place in O(log n + ) which will be equal to O(log n) in most cases But when we have millions of places – we need to perform millions of such queries
DCPGS-G: Grid-based
Results
Visualization-based Analysys
Social Entropy based Evaluation
CommuteTime, and Katz have the lowest social entropy however, these methods produce small clusters and have too many outliers Jaccard also has low social entropy for the same reason DCPGS is better than SimRank Social Entropy based Evaluation