DMiST- Data Mining in Spatio-Temporal sets
Input Number of time steps = T Example: T = 9 t=0t=1t=2t=3t=4t=5t=6t=7t=8 Entity: (x1,y1), (x2,y2), …, (x9,y9)
flock encounter convergence Input Number of entities/animals/items = n Example: n=4 and T=11 I 1 : (x 1 1,y 1 1), …, (x 1 T,y 1 T) I 2 : (x 2 1,y 2 1), …, (x 2 T,y 2 T) … I n : (x n 1,y n 1), …, (x n T,y n T)
Example Caribou Satellite Collar Project, Canada. Number of caribou = 15. Time steps = once a week for 8 years.
Input size? To obtain efficient solutions we need solutions that scales well, i.e. algorithms with limited dependency on the input. n - number of entities (20 millions) T – number of time steps (10 thousands) m – size of a flock (2 200) entities k – flock duration (5 50) time steps Size of input = nT Practical algorithms O((nT) 2 ) Fast algorithms O(nT log nT)
Six basic patterns 1.Encounter At least m entities pass through a circular region of radius r. 2.Convergence At least m entities are simultaneously within a circular region of radius r. 3.Flock At least m entities move together during a time interval of length at least s; for every point in time there is a circular region of radius r that contains all the entities. 4.Recurrences At least m entities are visiting a circular region of radius r at least k times. 5.Regular recurrences 6.Concurrent recurrences
Members NICTA Joachim Gudmundsson Thomas Wolle Ghazi Al-Naymat DSTO Brenton Williams Matthew Lowry Uni. of Sydney Sanjay Chawla Uni. of Queensland Xiaofang Zhou Heng Tao Shen Hoyoung Jeung Utrecht University Marc van Kreveld
Members NICTA Algorithms (apx) Computational Geometry Data mining DSTO Applications Data mining Uni. of Sydney Data mining Algorithms Uni. of Queensland Data base systems Data mining Utrecht University Algorithms GIS
Approximations Most problems cannot be solved fast! Instead we need to approximate the solution. Example: Convergence (Radius r is given) Find all discs of radius r that contains at least m entities. r Convergence m=10 Approximate #entities Approximate radius
Convergence Is there a point that is “covered” by at least m rectangles? Is there a disc of radius r that intersects at least m lines?
Convergence Good news: 2-approximation of the number of entities in O(Tn 2 /m) time. Bad news: Cannot be solved exactly faster than ~Tn 2.
Encounter Is there a disc of radius r that intersects at least m entities at some point in time? t1 t4 t3 t2 2r
Encounter - detect Idea: -Consider one “cylinder” C with radius 2r. -Compute the intersections between C and the n-1 paths. -If > 7m paths inside C at any time then “Encounter” Total time: O(n log n) / cylinder -If not, then solve exactly. Observation: The total size of all subsets within C is O(mn). Total time: O(n log n + nm) / cylinder Time O(Tn 2 (log n+m)).
Flock - definition m – flock size k – flock duration r – radius of disc t1t1 t2t2 t3t3 t4t4
Flock - Problem Problem: Find a largest flock. Problem is NP-hard. Problem as hard as MaxClique! t1t1 a c b d e t2t2 b c a e d c t3t3 b a e d t4t4 e a b d d b d e c t5t5 e a b c d e MaxClique
Flock – Hardness result Cannot be approximated in polynomial time within a factor of n 1- of the optimal. (even if we approximate the radius (factor 2)). Hopeless?
Flock Idea: An entity in the time interval [t 1,t d ] A point in 2d-dimensions t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 14-dimensional Euclidean space
Flock t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 Intersection of k (2k-2)-dimensional “cylinders”
Flock 1.For each i=k to T do 2. For every entity E in the time interval [t i,t i+k ] do 3. transform E to a point in 2k-dimensional space 4.Build a “Skip Quadtree” 5. For each point do 6. perform a 2k-dimensional range counting query. Approximation: 3-approximation of the radius Total time: O(Tk (n log n + (1.5) 2k ))
Flock – experimental results #entitiesFlock durationTime (s) 20K4<1 20K867 20K K K K K K K166800
What should be reported? Detect if a pattern exists, report. Report all patterns. Report “largest” pattern
Current and future research Advanced patterns –Regular recurrences –Hierarchical patterns –… Implement practical algorithms Algorithms and association rule mining Input data with errors? External memory algorithms? Generate test data