Parameter-Free Spatial Data Mining Using MDL. S. Papadimitriou, A. Gionis, P. Tsaparas, R.A. Väisänen, H. Mannila, and C. Faloutsos. International Conference.

Slides:

Advertisements

Similar presentations

Algorithm Analysis Input size Time I1 T1 I2 T2 …

Advertisements

Chapter 5: Tree Constructions

Boosting Textual Compression in Optimal Linear Time.

Random Forest Predrag Radenković 3237/10

Linear Regression.

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Fast Algorithms For Hierarchical Range Histogram Constructions

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

(speaker) Fedor Groshev Vladimir Potapov Victor Zyablov IITP RAS, Moscow.

Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.

DATA MINING LECTURE 7 Minimum Description Length Principle Information Theory Co-Clustering.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.

Mutual Information Mathematical Biology Seminar

MAE 552 – Heuristic Optimization Lecture 6 February 6, 2002.

Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.

Segmentation Divide the image into segments. Each segment:

Induction of Decision Trees

MAE 552 – Heuristic Optimization Lecture 26 April 1, 2002 Topic:Branch and Bound.

1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti

Classification II.

1 Fully Automatic Cross-Associations Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM)

© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.

FLANN Fast Library for Approximate Nearest Neighbors

Ensemble Learning (2), Tree and Forest

Radial Basis Function Networks

Entropy and some applications in image processing Neucimar J. Leite Institute of Computing

Efficient Model Selection for Support Vector Machines

Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.

An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.

GATree: Genetically Evolved Decision Trees 전자전기컴퓨터공학과 데이터베이스 연구실 G 김태종.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.

Chapter 10, Part II Edge Linking and Boundary Detection The methods discussed in the previous section yield pixels lying only on edges. This section.

Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.

The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.

CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.

Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.

Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',

Algorithm Analysis Data Structures and Algorithms (60-254)

CS654: Digital Image Analysis

For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.

Parity Augmentation An Alternative Approach to LDPC Decoding.

DATA MINING LECTURE 10 Minimum Description Length Information Theory Co-Clustering.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

1 Channel Coding (III) Channel Decoding. ECED of 15 Topics today u Viterbi decoding –trellis diagram –surviving path –ending the decoding u Soft.

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

1 30 November 2006 An Efficient Nearest Neighbor (NN) Algorithm for Peer-to-Peer (P2P) Settings Ahmed Sabbir Arif Graduate Student, York University.

Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.

Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.

D Nagesh Kumar, IIScOptimization Methods: M8L5 1 Advanced Topics in Optimization Evolutionary Algorithms for Optimization and Search.

1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.

ECE 101 An Introduction to Information Technology Information Coding.

Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.

Ch9: Decision Trees 9.1 Introduction A decision tree:

CS 685: Special Topics in Data Mining Jinze Liu

CS 685: Special Topics in Data Mining Jinze Liu

Presented by: Chang Jia As for: Pattern Recognition

CSE572: Data Mining by H. Liu

CS 685: Special Topics in Data Mining Jinze Liu

More on Maxent Env. Variable importance:

Stochastic Methods.

Presentation transcript:

Parameter-Free Spatial Data Mining Using MDL. S. Papadimitriou, A. Gionis, P. Tsaparas, R.A. Väisänen, H. Mannila, and C. Faloutsos. International Conference on Data Mining 2005

Problems:  Finding patterns of spatial correlation and feature co-occurrence.  Automatically  That is, parameter-free.  Simultaneously  For example:  Spatial locations on a grid.  Features correspond to species present in specific cells.  Each pair of cell and species is 0 or 1, depending on species present in that cell.  Feature co-occurrence:  Cohabitation of species.  Spatial correlation:  Natural habitats for species.

Motivation:  Many applications  Biodiversity Data  As we just demonstrated.  Geographical Data  Presence of facilities on city blocks.  Environmental Data  Occurrence of events (storms, drought, fire, etc.) in various locations.  Historical and Linguistic Data  Occurrence of words in different languages/countries, historical events in a set of locations.  Existing methods either:  Detect one pattern, but not both, or  Require user-input parameters.

Background  Minimum Description Length (MDL):  Let L(D|M) denote the code length required to represent data D given (using) model M. Let L(M) be the complexity required to describe the model itself.  The total code length is then:  L(D, M) = L(D|M) + L(M)  This was used in SLIQ and is the intuitive notion behind the connection between data mining and data compression.  The best model minimizes L(D, M), resulting in optimal compression.  Choosing the best model is a problem in its own right.  This will be explored further in the next paper I present.

Background  Quadtree Compression  Quadtrees:  Used to index and reason about contiguous variable size grid regions (among other applications, mostly spatial).  Used for 2D data; kD analogue is a kD-tree.  “Full Quadtree”: All nodes have either 0 or 4 children.  Thus, all internal nodes correspond to a partitioning of a rectangular region into 4 subregions.  Each quadtree’s structure corresponds to a unique partitioning.  Transmission:  If we only care about the structure (spatial partitioning), we can transmit a 0 for internal nodes and a 1 for leaves in depth-first order.  If we transmit the values as well, the cost is the number of leaves times the entropy of the leaf value distribution.

Example

Quadtree Encoding  Let T be a quadtree with m leaf nodes, of which m p have value p.  The total codelength is:  If we know the distribution of the leaf values, we can calculate this in constant time.  Updating the tree requires O(log n) time in the worst case, as part of the tree may require pruning.

Binary Matrices / Bi-groupings:  Bi-grouping:  Simultaneous grouping of m rows and n columns into k and l disjoint row and column groups.  Let D denote an m x n binary matrix.  The cost of transmitting D is given as follows:  Recall the MDL Principle: L(D) = L(D|M) + L(M).  Let {Q x, Q y } be a bi-grouping.  Lemma (we will skip the proof):  The codelength for transmitting an m-to-k mapping Q x where m p symbols are mapped to the value p is approximately:

Methodology  Exploiting spatial locality:  Bi-grouping as presented is nonspatial!  To make it spatial, assign a non-uniform prior to possible groupings.  That is, adjacent cells are more likely to belong to the same group.  Row groups correspond to spatial groupings.  “Neighborhoods”  “Habitats”  Row groupings should demonstrate spatial coherence.  Column groups correspond to “families”.  “Mountain birds”  “Sea birds”  Intuition  Alternately group rows and columns iteratively until the total cost L(D) stops decreasing.  Finding the global optimum is very expensive.  So our approach will use a greedy search for local optima.

Algorithms  INNER:  Group given the number of row and column groups. Start with an arbitrary bi-grouping of matrix D into k row groups and l column groups. do { Let for each row i from 1 to n 1 ≤ p ≤ k such that the “cost gain”: is maximized. Repeat for columns, producing the bi-grouping t += 2 } while (L(D) is decreasing)

Algorithms  OUTER:  Finds the number of row and column groups. Start with k 0 = l 0 = 1. Split the row group p* with the maximum per-row entropy, holding the columns fixed. Move each row in p* to a new group k T+1 iff doing so would decrease the per-row entropy of p*, resulting in a grouping Assign group to the result of INNER If the cost does not decrease, return Otherwise, increment t and repeat. Finally, perform this again for the columns.

Complexity  INNER is linear with respect to nonzero elements in D.  Let nnz denote those elements.  Let k be the number of row groupings and l be the number of column groupings.  Row swaps are performed in the quadtree and take O(log m) time each, where m is the number of cells.  Let T be the iterations required to minimize the cost.  O(nnz * (k + l + log m) * T)  OUTER, though quadratic with respect to (k + l ), is linear with respect to the dominating term nnz.  Let n be the number of row splits.  O((k + l ) 2 nnz + (k + l ) n log m)

Experiments  NoisyRegions  Three features (“species”) on a 32x32 grid.  So D has 32x32 = 1024 rows.  And 3 columns.  3% of each cell, chosen at random, has a wrong species, also randomly chosen.  The spatial and non-spatial groupings are shown to the right.  Recall: Bi-grouping is not spatial by default.  Spatial grouping reduces the total codelength.  The approach is not quite perfect due to the heuristic nature of the algorithm.

Experiments  Birds  219 Finnish bird species over x10km habitats.  Species are the features, habitats are cells.  So our matrix is 3813x219.  The spatial grouping is clearly more coherent.  Spatial grouping reveals Boreal zones:  South Boreal: Light Blue and Green.  Mid Boreal: Yellow.  North Boreal: Red.  Outliers are (correctly) grouped alone.  Species with specialized habitats.  Or those reintroduced into the wild.

Other approaches  Clustering  k-means  Variants using different estimates of central tendency:  k-medoids, k-harmonic means, spherical k-means, …  Variants determining k based on some criteria:  X-means, G-means, …  BIRCH  CURE  DENCLUE  LIMBO  Also information-theoretic.  Approaches either lossy, parametric, or aren’t easily adaptable to spatial data.

Room for improvement:  Complexity  O(n * log m) cost for reevaluating the quadtree codelength.  O(log m) worst-case time for each reevaluation/row swap * n swaps.  However, the average-case complexity is probably much better.  If we know something about the data distribution, we might be able to reduce this.  Faster convergence  Fewer iterations, reducing the scaling factor T.  Rather than stopping only when there is no decrease in cost, perhaps stop when we fall below a threshold? (Introduces a parameter)  Accuracy  The search will only find local optima, leading to errors.  We can employ some approaches used in annealing or genetic algorithms to attempt to find the global optimum.  Randomly restarting in the search space, for example.  Stochastic gradient descent – similar to what we’re already doing, actually.

Conclusion  Simultaneous and automatic grouping of spatial correlation and feature co-habitation.  Easy to exploit spatial locality.  Parameter-free.  Utilizes MDL:  Minimizes the sum of the model cost and the data cost given the model.  Efficient.  Almost linear with the number of entries in the matrix.

References 1. S. Papadimitriou, A. Gionis, P. Tsaparas, R.A. Vaisanen, H. Mannila, C. Faloutsos, "Parameter-Free Spatial Data Mining Using MDL", ICDM, Houston, TX, U.S.A., November 27-30, M. Mehta, R. Agrawal and J. Rissanen, "SLIQ: A Fast Scalable Classifier for Data Mining", in Proceedings of the 5th International Conference on Extending Database Technology, Avignon, France, Mar

Thanks! Any questions?