Presentation is loading. Please wait.

Presentation is loading. Please wait.

ST-COPOT---Spatial Temporal Clustering with Contour Polygon Trees

Similar presentations


Presentation on theme: "ST-COPOT---Spatial Temporal Clustering with Contour Polygon Trees"— Presentation transcript:

1 ST-COPOT---Spatial Temporal Clustering with Contour Polygon Trees
Talk Outline Interesting Regions ST-COPOT---Spatial Temporal Clustering with Contour Polygon Trees A Graph-based Interestingness Hotspot Discovery Framework Conclusion

2 Interestingness Hotspot Discovery
Research Overview Interestingness Hotspot Discovery Goal: Find hotspots in spatial datasets maximizing an externally given interestingness function based on a domain expert’s notion of interestingness Interestingness function: used to measure “news-worthiness” of a spatial region defined on the spatial and non-spatial attributes of the data can be defined on a set of objects Interestingness hotspot: Contiguous regions maximizing the given interestingness function Why: Can identify hotspots that cannot be identified by other methods. Efficient, generic, extendable, plugin architecture. Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0"

3 Research Overview Problem Definition
Given 1. Dataset O 2. Neighborhood relation NOO 3. Interestingness measure i:2O{0}+ The goal of this research is to develop frameworks and algorithms that find interestingness hotspots HO; H is an interestingness hotspot with respect to i, if the following 2 conditions are met: i(H)  H is contiguous with respect to N; that is, for each pair of objects (o,v) with o,vH, there has to be a path from o to v that traverses neighboring objects (w.r.t. N) belonging to H. In summary, interestingness hotspots H are contiguous regions in space that are interesting (i(H)  ). Moreover, our framework is also capable of finding spatial-temporal hotspots in gridded spatial-temporal datasets. Temporal attributes of grid cells are used for identifying neighboring grid cells in time dimension. A sample neighborhood definition is given in Section 5.1 for spatial-temporal grids.

4 Sample Interestingness Functions
Research Overview Sample Interestingness Functions Correlation: Variance: ivar (p) (H)= −𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝐻,𝑝), 𝑖𝑓 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝐻,𝑝 < 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Purity: 𝑖 𝑝𝑢𝑟 𝐻 = 0, max 𝑝𝑟𝑜𝑝 𝑡 𝐻 < 𝜃 ( max 𝑝𝑟𝑜𝑝 𝑡 𝐻 − 𝜃), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 icorr (p1, p2) (H)= 0, 𝑖𝑓 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝐻, 𝑝 1 , 𝑝 2 < 𝜃 |𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛( 𝑝 1 , 𝑝 2 )|−, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

5 Variance hotspot in a 4D gridded air pollution dataset
Research Overview Sample Hotspots Earthquake dataset. Correlation of EQ Depth and magnitude Green hotspot: Negative correlation, Orange hotspot: Positive correlation Variance hotspot in a 4D gridded air pollution dataset

6 Interestingness Hotspot Discovery: Phases
Research Overview Interestingness Hotspot Discovery: Phases Create graphs: Identify neighboring objects Identify hotspot seeds Grow hotspot seeds (based on a neighborhood graph) Remove highly overlapping hotspots Find the scope of each hotspot Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0"

7 Main Contributions of this Work
Research Overview Main Contributions of this Work Novel hotspot growing algorithms Graph-based post-processing algorithm Graph simplification algorithm to do 2. faster Agglomerative seed merge algorithm Voronoi-based polygon models describing the scope of spatial clusters Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0"

8 Spatial Scan Statistics
Motivation Spatial Scan Statistics Most popular method Finds regions with statistically high number of events compared to the rest of the dataset Statistical tests applied Likelihood ratio assigned to each region Monte Carlo simulations employed Optimal for circular regions Extensions exists for other shapes: Pyramid shaped Rectangular Ring-shaped Route networks, linear hotspots Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0"

9 Spatial Scan Statistics – Cons
Related Work Spatial Scan Statistics – Cons Very slow. Limited to a few thousand records Works with point-based datasets only Have difficulties dealing with outliers Restrictions of the shape of hotspots found Cannot use an interestingness function Correlation, variance etc. cannot be used. Can be used if interestingness is an attribute of an object, and does not depend on other points. Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0"

10 Methodology Architectural Diagram of the Framework

11 Neighborhood Definition
Methodology Neighborhood Definition Gridded and polygonal datasets: neighbors if sharing edges (or points) Point-based dataset: Not trivial. Gabriel graphs are used. Popular proximity graph types for a dog shaped dataset

12 Diameter circle and its relation to Gabriel neighborhood
Methodology Gabriel Graphs Two distinct points a and b in a point set are adjacent in the Gabriel graph if the closed disc d, of which the line segment ab is a diameter, contains no other points |E| ≤ 3 x |V| Diameter circle and its relation to Gabriel neighborhood

13 Phase 1. Identify Neighboring Objects
Methodology Phase 1. Identify Neighboring Objects Polygonal datasets: Polygons are neighbors if they share an edge Gridded datasets: Grid cells are neighbors if they share edge Point-based datasets: Points are neighbors if they are adjacent in the Gabriel graph

14 Phase 2. Identify Hotspot Seeds
Methodology Phase 2. Identify Hotspot Seeds Polygonal and point-based datasets: Create a region for each object: Consisting of the object and its 1st degree neighbors 1 seed around each object Highly overlapping seeds Gridded datasets: Divide the whole dataset into smaller rectangular regions of same size Many seeds can be merged: More efficiency

15 Phase 2.1. Merge Seeds using an agglomerative algorithm
Methodology Phase 2.1. Merge Seeds using an agglomerative algorithm Create a neighborhood graph of all seeds Merge neighboring seeds as long as merge is acceptable and there are merge candidates left. Recalculate merge candidates and neighborhood graph after each merge. Giving precedence to merge candidates with larger reward increase Acceptance criteria: merge(s1, s2) if R( (s1∪s2)) > (R(s1) + R(s2)) * µ

16 Phase 3. Growing Hotspot Seeds - Baseline Algorithm
Methodology Phase 3. Growing Hotspot Seeds - Baseline Algorithm Add the neighbor which increases the hotspot reward most when added Keep neighbors in a hash set Continue as long as interestingness is positive Keep a reference to the best reward value Output the hotspot state with best reward O(n2) runtime complexity (w/ incremental reward calculation) Reward change when growing

17 Figure 10. Change of reward value of a growing a hotspot
Methodology Hash set A very efficient implementation for holding a collection of objects Uses a hash function to keep a reference to objects O(1) runtime complexity for Add Remove Contains Hash set representation Figure 10. Change of reward value of a growing a hotspot

18 Phase 3. Growing Hotspot Seeds - Heap-based algorithm
Methodology Phase 3. Growing Hotspot Seeds - Heap-based algorithm Assign a fitness value to each neighbor when it is first encountered while growing Keep neighbors in a max-heap (priority queue) and hash set Add the neighbor at the root of the heap in each step O(n log n) complexity Continue as long as interestingness is positive Keep a reference to the best reward value Output the hotspot state with best reward

19 Heap (Priority Queue) Methodology
Optimal implementation of a priority queue data structure Heap property: See figure O(1) runtime complexity for: find-max Insert-node O(logn) for: Delete-max operation O(n) for find-node A Max-heap

20 Phase 4. Post-processing: Remove overlapping hotspots
Methodology Phase 4. Post-processing: Remove overlapping hotspots Many of the hotspots overlap to a large degree Need to eliminate highly overlapping low quality hotspots Input:  a set of hotspots S, and an overlap threshold λ where 0≤λ<1, Problem: Find a subset S’⊆S for which ∑H∈S Reward(H) is maximal where Reward(H) is the reward of hotspot H, subject to the following constraints: ∀H1∈S’ and ∀H2∈S’ (H1≠H2  overlap(H1, H2) ≤ λ ) where 𝑜𝑣𝑒𝑟𝑙𝑎𝑝 𝐻 1 , 𝐻 2 = | 𝐻 1 ∩ 𝐻 2 | min( 𝐻 1 , 𝐻 2 ) Solution: Reduce this problem to finding maximum weight independent set in the overlap graph of hotspots

21 Phase 4. Post-processing: Remove overlapping hotspots
Methodology Phase 4. Post-processing: Remove overlapping hotspots Calculate the reward of each hotspot using the plugin reward function or use the existing reward values calculated. Create a weighted overlap graph of hotspots in which weight of each vertex is the reward of the hotspot and there is an edge between two vertices if their degree of overlap is more than the overlap threshold λ. Simplify the overlap graph by eliminating vertices that cannot be in the optimal solution. Find the connected components in the simplified overlap graph. For each connected component Ci, create the complement graph Ci’. Find the maximum weight clique (MWC) in each complement graph Ci’. The union of all vertices in MWCs is the optimal solution

22 Phase 4. Post-processing: Remove overlapping hotspots
Methodology Phase 4. Post-processing: Remove overlapping hotspots 1. A set of overlapping hotspots 2. All overlaps 3. Overlap graph 130 80 4. Simplified graph 5. Complement graph 6. Max weight clique

23 Phase 4. Post-processing: Graph Simplification Algorithm
Methodology Phase 4. Post-processing: Graph Simplification Algorithm For each vertex, create a Set data structure and put the vertex itself and all of its adjacent vertices into the set. This set will be called “overlap set” of a vertex. 
 Compare each vertex’s overlap set with the overlap set of other vertices with which this vertex is connected. Add a vertex into a “removal set” if its weight is lower than a vertex with the same overlap set. 
 Remove vertices in the removal set from the overlap graph G. Significant improvement on MWC algorithm: 30 hour to 1 second on a graph with tens of thousands of edges Takes less than 1 second for thousands of vertices and edges.

24 Phase 5. Find the Scope of Hotspots
Methodology Phase 5. Find the Scope of Hotspots Polygonal and gridded hotspots: Merge polygons or grid cells to create a scope Point-based datasets: Create a polygonal boundary for the hotspot. Approach 1: Use Voronoi polygons and Convex Hull Requires access to the whole dataset Approach 2: Create tighter boundaries starting with Delaunay Triangulation, and eliminating long edges using a polygon fitness function Does not require access to the whole dataset Novel polygon emptiness measure Finds a balance between polygon emptiness and complexity

25 Methodology Voronoi Diagram Convex Hull (red) Polygon model (purple)

26 Outline Introduction Research Overview Related Work Methodology
Experimental Evaluation Conclusion Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0"

27 Correlation Hotspots in a Gridded Air Pollution Dataset
Experimental Evaluation Correlation Hotspots in a Gridded Air Pollution Dataset Identifying Seed regions: 235 seeds found with |correlation| > 0.95 Seed merge threshold µ = 0.96 108 seeds obtained in 0.16 seconds Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" A part of the seed neighborhood graph

28 Experimental Evaluation
Comparison to SatScan Evaluated based on hotspot size and interestingness New York taxicab dataset: 9626 pick-up and drop-off locations, taxi fare, tolls, tips, pick-up and drop-off times etc. Find regions where taxi drivers make more money per minute 𝑅𝑎𝑡𝑒 𝑜 = 𝑡𝑜𝑡𝑎𝑙 𝑓𝑎𝑟𝑒 𝑖𝑛𝑐𝑙𝑢𝑑𝑖𝑛𝑔 𝑡𝑖𝑝𝑠−𝑡𝑜𝑡𝑎𝑙 𝑐𝑜𝑠𝑡 𝑜𝑓 𝑡𝑟𝑖𝑝 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑖𝑛 𝑚𝑖𝑛𝑢𝑡𝑒𝑠 total cost = (gas price * total distance in miles / mpg + total fees) gas price = $2 per gallon , mpg = 20 miles per gallon Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0"

29 Experimental Evaluation
Taxicab dataset and Voronoi diagram

30 Comparison to SatScan: Framework Results
Experimental Evaluation Comparison to SatScan: Framework Results Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" 32 non-overlapping hotspots with at least 5 records in each. Hotspots with more than 10 objects reported visualized and reported the 5 hotspots with more than 10 objects in the result

31 Comparison to SatScan: SatScan Results
Experimental Evaluation Comparison to SatScan: SatScan Results Only 2 hotspots with 2 objects in each without removing outliers. Creates small hotspots with very high interestingness around extreme outliers. Very sensitive to outliers. Removed 27 records with rates > $3/min in the dataset and re-ran: Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" Large hotspots are not interesting!

32 Comparison to SatScan: Discussions
Experimental Evaluation Comparison to SatScan: Discussions Our framework is able to detect better interestingness hotspots compared to SatScan: higher interestingness Our framework is able to choose which areas are interesting and can grow the hotspot in that direction, whereas SatScan cannot exclude areas with low interestingness in the circular hotspot. SatScan is very sensitive to outliers Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" Hotspots detected by SatScan (circular) and our framework (polygons) in the same region

33 Outline Introduction Research Overview Related Work Methodology
Experimental Evaluation Conclusion Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0"

34 Summary We developed an agglomerative, graph-based framework for discovering interesting hotspots in spatial datasets based on a domain expert’s notion on interestingness To the best of our knowledge, proposed hotspot discovery algorithm is the only hotspot discovery algorithm in the literature that grows hotspots from seed regions using a reward function. Gabriel graphs are for neighborhood definition in our current work. We showed that the proposed framework is capable of identifying a much broader class of hotspots, compared to state of the art approaches. The proposed framework is very generic and can be used with any dataset in which a neighborhood relation between spatial objects can be defined. Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0"


Download ppt "ST-COPOT---Spatial Temporal Clustering with Contour Polygon Trees"

Similar presentations


Ads by Google