Parameter-Free Spatial Data Mining Using MDL. S. Papadimitriou, A. Gionis, P. Tsaparas, R.A. Väisänen, H. Mannila, and C. Faloutsos. International Conference.

1 Parameter-Free Spatial Data Mining Using MDL. S. Papadimitriou, A. Gionis, P. Tsaparas, R.A. Väisänen, H. Mannila, and C. Faloutsos. International Conference on Data Mining 2005

2 Problems:  Finding patterns of spatial correlation and feature co-occurrence.  Automatically  That is, parameter-free.  Simultaneously  For example:  Spatial locations on a grid.  Features correspond to species present in specific cells.  Each pair of cell and species is 0 or 1, depending on species present in that cell.  Feature co-occurrence:  Cohabitation of species.  Spatial correlation:  Natural habitats for species.

3 Motivation:  Many applications  Biodiversity Data  As we just demonstrated.  Geographical Data  Presence of facilities on city blocks.  Environmental Data  Occurrence of events (storms, drought, fire, etc.) in various locations.  Historical and Linguistic Data  Occurrence of words in different languages/countries, historical events in a set of locations.  Existing methods either:  Detect one pattern, but not both, or  Require user-input parameters.

4 Background  Minimum Description Length (MDL):  Let L(D|M) denote the code length required to represent data D given (using) model M. Let L(M) be the complexity required to describe the model itself.  The total code length is then:  L(D, M) = L(D|M) + L(M)  This was used in SLIQ and is the intuitive notion behind the connection between data mining and data compression.  The best model minimizes L(D, M), resulting in optimal compression.  Choosing the best model is a problem in its own right.  This will be explored further in the next paper I present.

5 Background  Quadtree Compression  Quadtrees:  Used to index and reason about contiguous variable size grid regions (among other applications, mostly spatial).  Used for 2D data; kD analogue is a kD-tree.  “Full Quadtree”: All nodes have either 0 or 4 children.  Thus, all internal nodes correspond to a partitioning of a rectangular region into 4 subregions.  Each quadtree’s structure corresponds to a unique partitioning.  Transmission:  If we only care about the structure (spatial partitioning), we can transmit a 0 for internal nodes and a 1 for leaves in depth-first order.  If we transmit the values as well, the cost is the number of leaves times the entropy of the leaf value distribution.

6 Example

7 Quadtree Encoding  Let T be a quadtree with m leaf nodes, of which m p have value p.  The total codelength is:  If we know the distribution of the leaf values, we can calculate this in constant time.  Updating the tree requires O(log n) time in the worst case, as part of the tree may require pruning.

8 Binary Matrices / Bi-groupings:  Bi-grouping:  Simultaneous grouping of m rows and n columns into k and l disjoint row and column groups.  Let D denote an m x n binary matrix.  The cost of transmitting D is given as follows:  Recall the MDL Principle: L(D) = L(D|M) + L(M).  Let {Q x, Q y } be a bi-grouping.  Lemma (we will skip the proof):  The codelength for transmitting an m-to-k mapping Q x where m p symbols are mapped to the value p is approximately:

9 Methodology  Exploiting spatial locality:  Bi-grouping as presented is nonspatial!  To make it spatial, assign a non-uniform prior to possible groupings.  That is, adjacent cells are more likely to belong to the same group.  Row groups correspond to spatial groupings.  “Neighborhoods”  “Habitats”  Row groupings should demonstrate spatial coherence.  Column groups correspond to “families”.  “Mountain birds”  “Sea birds”  Intuition  Alternately group rows and columns iteratively until the total cost L(D) stops decreasing.  Finding the global optimum is very expensive.  So our approach will use a greedy search for local optima.

10 Algorithms  INNER:  Group given the number of row and column groups. Start with an arbitrary bi-grouping of matrix D into k row groups and l column groups. do { Let for each row i from 1 to n 1 ≤ p ≤ k such that the “cost gain”: is maximized. Repeat for columns, producing the bi-grouping t += 2 } while (L(D) is decreasing)

11 Algorithms  OUTER:  Finds the number of row and column groups. Start with k 0 = l 0 = 1. Split the row group p* with the maximum per-row entropy, holding the columns fixed. Move each row in p* to a new group k T+1 iff doing so would decrease the per-row entropy of p*, resulting in a grouping Assign group to the result of INNER If the cost does not decrease, return Otherwise, increment t and repeat. Finally, perform this again for the columns.

12 Complexity  INNER is linear with respect to nonzero elements in D.  Let nnz denote those elements.  Let k be the number of row groupings and l be the number of column groupings.  Row swaps are performed in the quadtree and take O(log m) time each, where m is the number of cells.  Let T be the iterations required to minimize the cost.  O(nnz * (k + l + log m) * T)  OUTER, though quadratic with respect to (k + l ), is linear with respect to the dominating term nnz.  Let n be the number of row splits.  O((k + l ) 2 nnz + (k + l ) n log m)

13 Experiments  NoisyRegions  Three features (“species”) on a 32x32 grid.  So D has 32x32 = 1024 rows.  And 3 columns.  3% of each cell, chosen at random, has a wrong species, also randomly chosen.  The spatial and non-spatial groupings are shown to the right.  Recall: Bi-grouping is not spatial by default.  Spatial grouping reduces the total codelength.  The approach is not quite perfect due to the heuristic nature of the algorithm.

14 Experiments  Birds  219 Finnish bird species over 3813 10x10km habitats.  Species are the features, habitats are cells.  So our matrix is 3813x219.  The spatial grouping is clearly more coherent.  Spatial grouping reveals Boreal zones:  South Boreal: Light Blue and Green.  Mid Boreal: Yellow.  North Boreal: Red.  Outliers are (correctly) grouped alone.  Species with specialized habitats.  Or those reintroduced into the wild.

15 Other approaches  Clustering  k-means  Variants using different estimates of central tendency:  k-medoids, k-harmonic means, spherical k-means, …  Variants determining k based on some criteria:  X-means, G-means, …  BIRCH  CURE  DENCLUE  LIMBO  Also information-theoretic.  Approaches either lossy, parametric, or aren’t easily adaptable to spatial data.

16 Room for improvement:  Complexity  O(n * log m) cost for reevaluating the quadtree codelength.  O(log m) worst-case time for each reevaluation/row swap * n swaps.  However, the average-case complexity is probably much better.  If we know something about the data distribution, we might be able to reduce this.  Faster convergence  Fewer iterations, reducing the scaling factor T.  Rather than stopping only when there is no decrease in cost, perhaps stop when we fall below a threshold? (Introduces a parameter)  Accuracy  The search will only find local optima, leading to errors.  We can employ some approaches used in annealing or genetic algorithms to attempt to find the global optimum.  Randomly restarting in the search space, for example.  Stochastic gradient descent – similar to what we’re already doing, actually.

17 Conclusion  Simultaneous and automatic grouping of spatial correlation and feature co-habitation.  Easy to exploit spatial locality.  Parameter-free.  Utilizes MDL:  Minimizes the sum of the model cost and the data cost given the model.  Efficient.  Almost linear with the number of entries in the matrix.

18 References 1. S. Papadimitriou, A. Gionis, P. Tsaparas, R.A. Vaisanen, H. Mannila, C. Faloutsos, "Parameter-Free Spatial Data Mining Using MDL", ICDM, Houston, TX, U.S.A., November 27-30, 2005. 2. M. Mehta, R. Agrawal and J. Rissanen, "SLIQ: A Fast Scalable Classifier for Data Mining", in Proceedings of the 5th International Conference on Extending Database Technology, Avignon, France, Mar. 1996.

