Introduction Methods for Knowledge Discovery in Spatial Databases ◦ Generalization-Based Knowledge Discovery ◦ Methods Using Clustering ◦ Methods Exploring Spatial Associations ◦ Using Approximation and Aggregation ◦ Mining in Image Databases Future Directions Conclusion References
Our objectives: ◦ Describe existing spatial data mining methods ◦ Give a general perspective of the field’s current state ◦ Summarize the paper’s description of the state of spatial data mining in 1996.
A spatial database is a database that is optimized to store and query data that is related to objects in space, including points, lines and polygons. While typical databases can understand various numeric and character types of data, additional functionality needs to be added for databases to process spatial data types. These are typically called geometry or feature.
Database systems use indexes to quickly look up values and the way that most databases index data is not optimal for spatial queries. Instead, spatial databases use a spatial index to speed up database operations. In addition to typical SQL queries such as SELECT statements, spatial databases can perform a wide variety of spatial operations. A few examples are: Spatial Measurements Spatial Functions Spatial Predicates Constructor Functions Observer Functions
Data mining is a field at the intersection of computer science and statistics, it is the process that attempts to discover patterns in large data sets. It utilizes methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.
“Spatial data mining, or knowledge discovery in spatial database, refers to the extraction of implicit knowledge, spatial relations, or other patterns not explicitly stored in spatial databases.” (Koperski and Han, 1995) Data mining, or knowledge discovery in databases, refers to the “ discovery of interesting, implicit, and previously unknown knowledge from large databases.” (Frawley et al, 1992) WHAT’S THE DIFFERENCE?
DATA = an attribute of an object. SPATIAL DATA = Attribute data referenced to a specific location. The Attributes of spatial objects are: - Highly dependant on location - Often influenced by neighboring objects The (WHAT) dimension. (WHERE) & (WHAT)
ObjectWhat MilkIs cold YogurtIs warm ButterIs warm
ObjectWhereWhat MilkIn fridge (X 1,Y 1 ) Is cold MilkOn table (X 2,Y 2 ) Is warm ObjectWhereWhat MilkIn fridge (X 1,Y 1 ) Is warm YogurtIn fridge (X 1,Y 1 ) Is warm ButterIn fridge (X 1,Y 1 ) Is warm SPATIAL DATA MINING: THE FRIDGE IS BROKEN! Object() : Location() -> Characteristic() (Milk) : (In Fridge) -> [should be] -> (Cold)
To understand spatial data To discover relationships between spatial and non spatial data To capture the general characteristics in a concise way To build spatial knowledge-bases
Various kinds of rules can be discovered from databases in general Spatial Characteristic Rule: General description for spatial data. Example: A rule describing the general price range of houses in various geographic regions in a city.
Spatial Discriminant Rule: General description of the features discriminating or contrasting a class of spatial data from other classes. Example: Comparison of price ranges of houses in different geographical regions. Spatial Association Rule: Describes the implication of one or a set of features by another set of features in spatial databases. Example: Associating the price range of the houses with nearby spatial features, like beaches.
Thematic Maps: Presents the spatial distribution of attributes. Differs from general maps where the objective is to present the positions of objects. Used for discovering different rules 2 ways to represent a thematic map: ◦ Raster ◦ Vector
Image Databases: Special kind of spatial databases where data almost entirely consists of images or pictures. Used in remote sensing, medical imaging Usually stored in form of grid arrays representing the image intensity in spectral ranges.
Spatial Data Structures: Consists of points, lines, rectangles, etc. Spatial Computations: Spatial join is one of the most expensive spatial operations. Map overlay is an important operation for geographic information systems
SD mining algorithms must efficiently overcome: The huge volume of spatial data The complexity of spatial data types/structures The complexity of spatial accessing/query methods Expensive spatial processing operations m Object = HUGE S-DB GO! Where is: Citizen{Brad} Which highways cross Nat’n Park boundaries? = spatial JOIN
◦ Generalization-Based Knowledge Discovery ◦ Methods Using Clustering ◦ Methods Exploring Spatial Associations ◦ Using Approximation and Aggregation ◦ Mining in Image Databases
Spatial Data ◦ Geometric (location, area, perimeter) ◦ Topological (adjacency, inclusion) Non-Spatial Data ◦ Stored in a traditional database with an attribute that is a pointer to the spatial description (Aref et al )
Need for background knowledge in the form of concept hierarchies : spatial and non- spatial ◦ Non-Spatial: ◦ Spatial: Counties -> Provinces -> Larger Regions
Two generalization-baseds algorithms presented by Lu et al : ◦ Spatial Data Dominant Generalization: Generalization of the spatial objects continues until the "spatial generalization threshold" is reached. Non-spatial data is analyzed. ◦ Non-Spatial Data Dominant Generalization: The non-spatial data is generalized into higher concept level. Neighbouring areas with similar generalized attributes are merged. Dependent upon the concept hierarchies and need for algorithms that does not use these hierarchies
No need for a background information like concept hierarchies Foundation of clusters directly from the data A similar approach in machine learning is called "unsupervised learning" Clustering algorithms: ◦ PAM (Kaufmann and Rousseeuw – 1990) ◦ CLARAN (Kaufmann and Rousseeuw – 1990) ◦ CLARANS (Ng and Han ) SD (CLARANS) – Spatial Dominant Approach NSD (CLARANS) – Non-Spatial Dominant Approach
n objects and k clusters Selecting the most representative point for each cluster Most centrally located point in a cluster – medoid Computationally inefficient
29 Total Cost = K=2 Arbitrary choose k object as initial medoids Assign each remaining object to nearest medoids Randomly select a nonmedoid object,O ramdom Compute total cost of swapping Total Cost = 26 Swapping O and O ramdom If quality is improved. Do loop Until no change
Very similar to PAM Sampling is the difference The idea: If the sample is selected in a random manner, it will be representing the data correctly and, therefore, CLARA can deal with larger data sets with respect to PAM. The drawback: May not do the best clustering -> An object is a medoid and it is not selected when sampling
Tries to mix PAM and CLARA The idea: CLARA -> All stages with same sample CLARANS -> At each step a random sample The chance to miss the potential solutions is decreased to minimum Experimentally shown as more efficient than PAM and CLARA Also, detection of outliers i.e points that are not belong to any cluster
Based upon CLARANS, two spatial mining algorithms were developed: ◦ SD (CLARANS) – Spatial Dominant Approach ◦ NSD (CLARANS) – Non-Spatial Dominant Approach
Clustering of objects spatially by CLARANS Attribute-oriented induction on non-spatial description of the objects Result: Description of each cluster by it’s relative non- spatial attributes Spatial Cluster: Downtown Edmunton Non-Spatial (Attribute) Cluster: 50% Commercial, 40% Residental, 10% Public Services
Attribute-oriented generalization on non-spatial description of the objects For each generalized tuple, clustering of the objects by CLARANS In the case of an overlap between clusters, the merging of those clusters Spatial Cluster: Region East of 50 St, South of 35 St, North of Whitemud. Non-Spatial (Attribute) Cluster: Mostly Industrial
Two Drawbacks: ◦ The assumption that all objects to be clustered are stored in main memory Requirement of disk-based methods Integration with spatial access methods like R* tree ◦ Efficiency of the algorithm can be improved by some modifications Focusing on the representative objects when selecting medoids Restricting the access to certain objects that do not actually contribute to the computation
Presented because of the non-availability or time consuming construction of the R-Trees An algorithm presented by Zhang et al. – 1996 Can be used with any clustering algorithm like CLARANS Two concepts that are used in the algorithm: ◦ Clustering Feature A triple summarizing information about subclusters of points ◦ CF Tree A balanced tree that stores clustering features
(3,4) (2,6) (4,5) (4,7) (3,8) Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: N i=1 =X i SS: N i=1 =X i 2 CF = (5, (16,30),(54,190))
Desire to discover rules that associate spatial objects with other spatial objects Association rules -> Agrawal et al. – 1993 Association rules for spatial databases -> Koperski and Han – 1995 Example Rule: is_a(x, school) -> close_to(x, park) (80%) Other predicates: intersects, overlap, disjoint, left_of, west_of, close_to, far_away
Support ( Support (X) ) Probability that a tuple contains X Confidence ( X->Y ) Probability that a tuple having X also contains Y Predefined thresholds (minimum support, minimum confidence) to determine associations Strong Rule: A Rule that has a support, no less than minimum support; and has a confidence, no less than minimum confidence
Methods that answer where the groups of data are More interesting: Why the clusters are there? Rephrased: What are the characteristics of the clusters in terms of the features that are close to them? C1 C2 C1 C2 Cluster C1 centered at X 1,Y 1 Cluster C2 centered at X 2,Y 2 45% of the objects in Clusters C1 & C2 are close to feature (River)
CRH algorithm presented by Knorr and Ng – 1995 CRH: C (encompassing circle), R (isothetic rectangle), H (convex hull) Using filters to reduce the candidate features from a large number of features Input: Any number of features and one cluster Output: List of relevant features and points located from a defined distance from features
There have been studies in this field, on the automatic recognition and categorization of objects Usual systems are composed of 3 components: Data focusing Feature extraction Classifaction learning
Data Focusing: Increases the overall efficiency of the system by first identifying the portion of the image being analyzed that is most likely to containt the target object Feature Extraction: Extracts interesting features from the data. Pattern recognition methods are used Classification Learning: Discriminates the target objects from other objects that look alike
This system has about 80% accuracy. It is difficult for experts to provide classifications with 100% certainty False classifications can produce large errors because they are treated as negative examples
Identified Future Directions for spatial data mining: ◦ Data Mining in Spatial Object Oriented DB ◦ Alternative Clustering Techniques: Clustering overlapping objects, Fuzzy Clustering of Spatial Data ◦ Mining under uncertainty: Evidential reasoning, Fuzzy sets approaches ◦ Spatial Data Deviation and Evolution Rules: Rule application to data that changes over time ◦ Interleaved Generalization (spatial and non-spatial) ◦ Generalization of Temporal Spatial Data (data evolution) ◦ Parallel Data Mining (multi-processor systems) ◦ Spatial Data Mining Query Language ◦ Multidimensional Rule Visualization and Multiple Thematic Maps “The variety of yet unexplored topics and problems makes knowledge discovery in spatial databases an attractive and challenging research field.”
TopicCurrently Active Research Field References DM in Spatial Obj-Oriented DBYes >10 (1997 – 2012) Alternative / Fuzzy Spatial Clustering Yes 1996, >10 (1997 – 2012) Mining under uncertaintyYes >10 (1997 – 2012) Deviation / Evolution RulesYes >10 (1997 – 2012) Interleaved GeneralizationVague… Generalization of Temporal Spatial Data Yes >10 (1997 – 2012) Parallel Data Miningmerged Spatial Data Mining Query Language Yes: GeoMiner, GMQL, Spatial SQL 1991, 1994, 1996, 1997 (h&k), >10 (1997 – 2012) Visualization TopicsYes – esp. GIS >10 (1997 – 2012)
Data Mining / Knowledge Discovery of Spatial Data is a large, active research area. While it was a “young” field at the time this survey paper was written, it is quickly maturing in applications such as: ◦ Geographic Information Systems ◦ Medical Imaging ◦ Robotics Navigation
deminsional-mbr.png object-model-examples.png projection.png kota/Map_Service_01.jpg kota/4D_Screen.jpg
Krzysztof Koperski, Junas Adhikary, Jiawei Han. Spatial Data Mining: Progress and Challenges Survey Paper. Workshop on Research Issues on Data Mining and Knowledge Discovery, 1996
W. G. Aref and H. Samet. Extending DBMS with Spatial Operations. In Proc. 2nd Symp. SSD’91, pp , Zurich, Switzerland, Aug W. Lu, J. Han, and B. C. Ooi. Discovery of General Knowledge in Large Spatial Databases. In Proc. Far East Workshop on Geographic Information Systems pp , Singapore, June L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. In Proc Int. Conf. Very Large Data Bases, pp , Santiago, Chile, September T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an Efficient Data Clustering Method for Very Large Databases. In Proc ACM-SIGMOD Int. Conf. Management of Data, Montreal, Canada, June R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules Between Sets of Items in Large Databases. In Proc ACM-SIGMOD Int. Conf. Management of Data, pp , Washington, D.C., May K. Koperski and J. Han. Discovery of Spatial Association Rules in Geogrpahic Information Databases. In Proc. 4th Int’l Symp. On Large Spatial Databases (SSD’95), pp , Portland, Maine, August E. Knorr and R. T. Ng. Applying Computational Geometry Concepts to Discovering Spatial Aggregate Proximity Relationships. In Technical Report, University of British Columbia, Paper Review and Slides by Brad Danielson