DB group seminar 2006/06/29The University of Hong Kong, Dept. of Computer Science Neighborhood based detection of anomalies in high dimensional spatio-temporal Sensor Datasets (SAC’04) Nabil R. Adam Vandana Pursnani Janeja Vijayalakshmi Atluri Presented by Leonidas Mak
DB group seminar Agenda Spatial data mining Problem & proposed solution Approach overview Implementation detail Discussion of result
DB group seminar Spatial data mining Deals with knowledge discovery from spatial data sets of: Spatial (point, location, etc.) Non-spatial (population, speed, etc.) Two properties of spatial objects make spatial data mining different from others Spatial dependency Spatial heterogeneity
DB group seminar Spatial data mining When considering spatial object: Spatial & non-spatial attributes Implicit and explicit spatial relationships Region of influence Underlying spatial process Influence the behavior of the object and its neighboring objects
DB group seminar Spatial data mining Consider the spatial process near the objects when performing spatial analysis To identify outliers & trends in the region of influence Spatial features in the vicinity of the objects Underlying spatial process Identify similarly behaving objects
DB group seminar Problem & proposed solution Spatial outlier detection Objects behave very differently from their neighborhood Graph based neighborhood [3] [11] [3][11] Does not capture the semantic relationship between the objects and the area of influence Some clustering techniques also Delaunay triangulation [5] [5] Voronoi diagram
DB group seminar Problem & proposed solution Refine the concept of “a neighborhood of an object” To characterize similarly behaving objects Spatial relationships Semantic relationships Identification of spatio-temporal outliers in high dimensions
DB group seminar Proposed approach to solution Take into account of both spatial and semantic relationships Features of these objects can be different Despite the close proximity of them Each object has an immediate neighborhood Micro Neighborhood (M i ) M i can be extended or merged with others Macro Neighborhood (MaN)
DB group seminar Some definitions Outlier (in terms of distance) [7] [7] An object o in a dataset T is a DB(p,D) outlier if at least a fraction p of the objects in T are at a greater distance D from o Voronoi diagrams (of a set of objects O) [10] [10] The subdivision of the plane into n polygons, with a point q in the polygon corresponding to object o i iff
DB group seminar Some definitions Jaccard Coefficient (JC) [10] [10] Measure the similarity of asymmetric binary variables To quantify the similarity match (1-1 match), indicating the similarity of features dc0 ba1 01 Object i Object j
DB group seminar Some definitions Silhouette Coefficient (SC) [10] [10] To identify the quality of clustering result in terms of structure and its overlapping on other clusters [6] [6] 0.7 < SC <= 1.0 Strong structure 0.5 < SC <= 0.7 Medium structure SC < 0.25 no structure To indicate the similarity of two sparial micro neighborhoods silhouette of data i Silhouette Coefficient of cluster X
DB group seminar Overview of the approach 1. Generation of Micro Neighborhood To generate Voronoi polygons Input: set of objects with spatial locations Output: Voronoi diagram 2. Identification of Spatial Relationships Input: Voronoi diagram, edge list Output: adjacency matrix indicating if one M i is a neighbor of any other M i s
DB group seminar Overview of the approach 3. Identification of semantic relationships Calculating JC and SC Input of JC part: a set of micro neighborhoods Characterized by feature vector Representing the spatial processes Input of SC part: a set of micro neighborhoods Characterized by a set of points Readings over a period of time
DB group seminar Overview of the approach 4. Generation of Macro Neighborhood Input: neighborhood (adjacency) matrix, JC, SC Output: Macro Neighborhood 5. Detecting outliers Based on the distance values of various points Use Distance based outlier detection [7]
DB group seminar Overview of the approach Generation of Micro Neighborhood Identification of Spatial Relationships Identification of Semantic Relationships Generation of Macro Neighborhood Outliers Detection Obj. set Voronoi Diagra m Edge list Feature vector Set of points Neighborhood matrix JC SC Macro neighborhoo d
DB group seminar Generation of Micro Neighborhood The definition of neighborhood is based on the concept of Voronoi diagrams Generate the Voronoi polygon around each spatial object A feature q lies in a Voronoi polygon is associated with the related object Region of influence is defined as the Voronoi polygon
DB group seminar Generation of Micro Neighborhood Micro Neighborhood (M i ) is defined as: Region of influence; dominance of one object over the other Spatial features have their own spatial process Sensor (object) River (spatial feature) Micro Neighborhood
DB group seminar Identification of Spatial Relationships Spatial relationships are binary relations between pairs of objects Object: point, line, polygons, etc. Relationship: topological, distance, etc. Topological relationship of adjacency Determined by the shared edge of two Voronoi polygon Edge list is generated by Triangle: 2D mesh generator [12] for the Delaunay triangulation
DB group seminar Identification of Spatial Relationships Edge list format Edge# Two micro neighborhoods are adjacent If there is an edge between two 2 spatial objects The adjacency information is stored in the neighborhood adjacency matrix
DB group seminar Identification of Semantic Relationships Micro Neighborhood can be characterized by Present/absent of spatial features Other spatial processes Results in feature vector of 0’s and 1’s [14] [14] Object itself may also have an associated set of readings (points in neighborhood) Make use of the features and also the data points in the neighborhood
DB group seminar Identification of Semantic Relationships JC is used to identify binary valued attributes in feature vector SC is used for non-binary valued attributes, such as readings of sensors To measure the overlap of the micro neighborhoods Based on the readings over a period of time Two micro neighborhoods are considered as semantic similar for Higher JC Lower SC
DB group seminar Generation of Macro Neighborhood Each M i can be consider as an implicit sub- cluster or grouping Macro Neighborhood can be defined in terms of Spatial relationship between M i Semantic relationship Spatial, non-spatial attributes Macro Neighborhood is defined as a graph: With outer edges E’ from M i Links, l = (m i,m i+1 ) holds iff spatial & semantic neighbor
DB group seminar Generation of Macro Neighborhood Spatial neighbor (m i,m i+1 ) refers to spatial relation between polygons Semantic neighbor refers to semantic relation based on JC & SC such that Merge the M i & M j to form MaN
DB group seminar Outlier detection Graph based spatial outlier detection [11] [11] It is important to identify the outliers as well as the neighborhood Since a given point can be the outlier of several clusters Spatio-Temporal Outlier is defined as: A point x i is a spatio-temporal outlier iff it differs sufficiently from other points in the Marco neighborhood
DB group seminar Outlier detection First identify Macro Neighborhood Utilize distance based outlier detection technique [7] [7] Consider proximity in terms of distance threshold as one of the determining factor Investigate whether the object is an outlier (spatial outlier) If more than a certain number of points are outliers for that object
DB group seminar Dataset Data sets Highway traffic monitoring [11] [11] Water monitoring [14] [14] Highway traffic monitoring Traffic reading from 60 stations in time slots of 5 minutes Non-spatial attributes: volume, occupancy Spatial attributes: latitude, longitude Feature matrix: traffic flow direction, clustering
DB group seminar Dataset Water monitoring 7 stations monitoring water quality of rivers Feature matrix consists of 21 features Used to show the characteristics in the M i Spatial attributes: latitude, longitude Temporal attributes: date, time of sampling Data points consists of >100 attributes
DB group seminar Results (Spatial) Spatial relationships are identified by applying program TRIANGLE [12] [12] Generate edges for nodes that are judged adjacent to each other Adjacency is expressed into a matrix High connectivity collapse into one big neighborhood
DB group seminar Results (Spatial + JC) Incremental building of Macro Neighborhood JC = 0.5 MaN consists of polygons 2,4,6,7 JC = 0.2 MaN consists of polygons 2,3,4,6,7 Incremental merging on the basis of less restrictive threshold of JC
DB group seminar Results (Spatial + JC) Refinement in outliers detected Number of outliers detected varied as JC changes WaterMonitoring Data: Num Outliers vs. JC JC THRESHOLD NUM. OUTLIERS
DB group seminar Results (Spatial + JC) Systematic elimination of outliers Consistency in Outlier detection If one neighborhood has no outliers at low JC threshold, it is consistently at higher threshold value O1: Outliers detected at high threshold of JC O2: Outliers detected at low threshold of JC 2,4JC = 0.8 2,3,4,8JC = 0.5 Outliers (part of)
DB group seminar Results (Spatial + SC) Similar conclusion for adding SC SC decrease Neighborhood is more refined WaterMonitoring Data: Num Outliers vs. SC JC THRESHOLD NUM. OUTLIERS
DB group seminar Results (Spatial + JC + SC) Low JC & High SC big neighborhood More outliers High JC & Low LC refined neighborhood Reduced outliers
DB group seminar References: [1] F. Aurenhammer. Voronoi Diagrams: A Survey of a Fundamental Geometric Data Structure. ACM Computing Surveys, Vol 23(3), , 1991 [2] M. Ester, A. Frommelt, H.-P. Kriegel, and J. Sander. Algorithms for characterization and trend detection in spatial databases. In Proceedings of 4th Int. Conf. on Knowledge Discovery and Data Mining (KDD), [3] M. Ester, H. P. Kriegel, and J. Sander. Spatial Data Mining: A Database Approach. In Proceedings of the International Symposium on Large Spatial Databases, Berlin, Germany, July 1997, pp [4] M. Ester, H. -P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. In Proceedings of 4th Int. Conf. on Knowledge Discovery and Data Mining (KDD), [5] I. Kang, T. Kim, and K. Li. A Spatial Data Mining Method by Delaunay Triangulation. In Proceedings of the 5th International Workshop on Advances in Geographic Information Systems (GIS-97), pages 35-39, 1997.
DB group seminar References: [6] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, [7] E. M. Knorr and R. T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. In Proceedings of 24th Int. Conf. Very Large Data Bases, VLDB, 1998 [8] H. J. Miller and J. Han, Geographic Data Mining & Knowledge Discovery, Publisher: Taylor & Francis; 1st edition [9] Minnesota Highway traffic dataset: [10] A. Okabe, B. Boots, K. Sugihara, S. Chiu. Spatial Tessellations: Concepts and Applications of Voronoi Diagrams. pp John Wiley, 2000.
DB group seminar References: [11] S. Shekhar, C. Lu, and P. Zhang. Detecting Graph-Based Spatial Outlier: Algorithms and Applications(A Summary of Results). In Computer Science & Engineering Department, UMN, Technical Report , [12] J. R. Shewchuk, Triangle: Engineering a 2D Quality Mesh Generator and Delaunay Triangulator. First Workshop on Applied Computational Geometry (Philadelphia, Pennsylvania), pages , ACM, May 1996 [13] D. Unwin, Introductory Spatial analysis, Publisher: Routledge Kegan & Paul. January 1982 [14] USGS, National Stream Water Quality Network (NASQAN), Published Data: [15] Water Monitoring, the Meadowlands Environmental Research Institute, and the New Jersey Meadowlands Commision :