Spatial Temporal Data Mining

Slides:

Advertisements

Similar presentations

1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,

Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Discovering Lag Interval For Temporal Dependencies Larisa Shwartz Liang Tang, Tao Li, Larisa Shwartz1 Liang Tang, Tao Li

Christoph F. Eick Questions and Topics Review Nov. 22, Assume you have to do feature selection for a classification task. What are the characteristics.

STUN: SPATIO-TEMPORAL UNCERTAIN (SOCIAL) NETWORKS Chanhyun Kang Computer Science Dept. University of Maryland, USA Andrea Pugliese.

B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree

Chapter 7 – Classification and Regression Trees

Chapter 7 – Classification and Regression Trees

CONNECTIVITY “The connectivity of a network may be defined as the degree of completeness of the links between nodes” (Robinson and Bamford, 1978).

SASH Spatial Approximation Sample Hierarchy

1 Complexity of Network Synchronization Raeda Naamnieh.

Evolutionary Computational Intelligence Lecture 10a: Surrogate Assisted Ferrante Neri University of Jyväskylä.

Accessing Spatial Data

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Spatio-Temporal Databases

Evaluating Hypotheses

An Approach to Active Spatial Data Mining Wei Wang Data Mining Lab, UCLA March 24, 1999.

Core Text Mining Operations 2007 년 02 월 06 일 부산대학교 인공지능연구실 한기덕 Text : The Text Mining Handbook pp.19~41.

Chapter 3: Data Storage and Access Methods

Techniques and Data Structures for Efficient Multimedia Similarity Search.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

CS4432: Database Systems II

Query Processing Presented by Aung S. Win.

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

Detecting Distance-Based Outliers in Streams of Data Fabrizio Angiulli and Fabio Fassetti DEIS, Universit `a della Calabria CIKM 07.

Data Mining Techniques

Describing distributions with numbers

Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.

A Simple Method to Extract Fuzzy Rules by Measure of Fuzziness Jieh-Ren Chang Nai-Jian Wang.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Quantifying the dynamics of Binary Search Trees under combined insertions and deletions BACKGROUND The complexity of many operations on Binary Search Trees.

Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.

IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.

Why Is It There? Getting Started with Geographic Information Systems Chapter 6.

The BIRCH Algorithm Davitkov Miroslav, 2011/3116

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Chapter 9 – Classification and Regression Trees

Module 5 Planning for SQL Server® 2008 R2 Indexing.

TAR: Temporal Association Rules on Evolving Numerical Attributes Wei Wang, Jiong Yang, and Richard Muntz Speaker: Sarah Chan CSIS DB Seminar May 7, 2003.

Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

Leonardo Guerreiro Azevedo Geraldo Zimbrão Jano Moreira de Souza Approximate Query Processing in Spatial Databases Using Raster Signatures Federal University.

Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.

Spatial Query Processing Spatial DBs do not have a set of operators that are considered to be basic elements in a query evaluation. Spatial DBs handle.

Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.

August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.

Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.

Spatial-Temporal Data Mining Wei Wang Data Mining Lab Computer Science Department UCLA.

Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.

Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.

Data Mining and Decision Support

1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree ： An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.

Dynamics of Binary Search Trees under batch insertions and deletions with duplicates ╛ BACKGROUND The complexity of many operations on Binary Search Trees.

A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University.

CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.

QED : An Efficient Framework for Temporal Region Query Processing Yi-Hong Chu 朱怡虹 Network Database Laboratory Dept. of Electrical Engineering National.

Spatial Data Management

Data Mining Soongsil University

Ch9: Decision Trees 9.1 Introduction A decision tree:

Extra: B+ Trees CS1: Java Programming Colorado State University

Algorithm Analysis CSE 2011 Winter September 2018.

CS 685: Special Topics in Data Mining Jinze Liu

B+-Trees and Static Hashing

Graphs Chapter 11 Objectives Upon completion you will be able to:

Database Design and Programming

Birch presented by : Bahare hajihashemi Atefeh Rahimi

Continuous Density Queries for Moving Objects

Tree-Structured Indexes

CS 685: Special Topics in Data Mining Jinze Liu

Presentation transcript:

Spatial Temporal Data Mining Wei Wang Data Mining Lab, UCLA January 21, 1999

Outline Introduction Statistical Clustering User-defined Trigger Spatial Index Structure for High Dimensional Point Data Temporal Spatial Pattern Detection ongoing research

Introduction Spatial data mining has been an active research area during recent years. For some well know problem, e.g., clustering, many existing algorithms are not efficient enough. There is still room for improvement. There are a lot of interesting problems remaining uninvestigated. We classify a subset of problems and try to solve them efficiently.

Outline STING: a statistical information grid approach to spatial data mining STING+: an approach to active spatial data mining PK-tree: a spatial index structure for high dimensional point data Temporal spatial pattern detection

STING Spatial database is usually huge. Efficiency of the data mining algorithm is crucial. Example: each person is an object Query: Find high income area within California high income: salary > $50,000 area > 4 square miles Traditional Method Step 1: Select out all person whose salary are high. Step 2: Do clustering analysis on those persons selected out. Step 3: Form the region that each cluster occupies. Step 4: Return those regions larger than 4 square miles. If “high income” is defined as: 80% persons have salary > $50,000 then the previous method can not even answer the query. STING was proposed to solve such problem efficiently.

STING Region Query Example: Select the maximal regions that have at least 100 houses per unit area and at least 70% of the house prices are above $400K and with total area at least 100 units with 90% confidence. SELECT REGION FROM house-map WHERE DENSITY IN (100, ) AND price RANGE (400000, ) WITH PERCENT (0.7, 1) AND AREA (100, ) AND WITH CONFIDENCE 0.9

STING Objects are represented by points, each of which has associated spatial attributes, its location, and non-spatial (numerical) attributes. Space is recursively divided into smaller rectangular cells until certain level is reached. A hierarchical structure is employed. The average number of objects in a leaf cell is in the range from several dozens to several thousands. Preprocess data capture the statistical information

STING 1st layer (root) (i-1)th layer ith layer (i+1)th layer (leaf layer)

STING For each cell, we have attribute-independent parameter n: number of objects attribute-dependent parameters (for each numerical attribute) mean: mean value of the attribute std: standard deviation of the attribute value min: the minimum value of the attribute max: the maximum value of the attribute distribution: the type of distribution that best fits the attribute value (can be NONE) Bottom-up generation when the data is loaded into the database. Linear compilation time Only has to be done once  not for each query.

STING Take advantage of the statistical information captured. Only go through relevant cells at each level. Root is relevant. For each relevant cell, we exam its children at next level by statistical test and label them as relevant or not relevant. Form regions from relevant leaf level cells. Do not need to access full database. It is very efficient.

STING The computational complexity of STING is linearly proportional to the number of leaf cells. We used the SEQUOIA 2000 benchmark as the data set to compare the performance of STING with other approaches. DBSCAN STING

STING STING is a query-independent approach. The statistical information exists independently of queries. STING has a much smaller response time compared to other approaches The computational complexity is linearly proportional to the number of leaves. I/O cost is low. STING can support different resolution of query result. Regions returned by STING approach that returned by DBSCAN when the granularity approaches zero. Parameters in the hierarchical structure can be maintained efficiently by incremental update.

Outline STING: a statistical information grid approach to spatial data mining STING+: an approach to active spatial data mining PK-tree: a spatial index structure for high dimensional point data Temporal spatial pattern detection

STING+ Moreover, since objects evolve, interesting patterns may emerge or disappear over time. Example: Trigger: Do bandwidth reallocation when the average call length is greater than 10 minutes within the region where at least 10 cellular phones are in use per squared mile. This can not be supported by traditional database triggers efficiently due to the fact that the class membership of an object is not only determined by its non-spatial attributes but also by the attributes of objects in its neighborhood. Naïve approach: re-evaluate condition periodically. Not efficient.

STING+ STING+ was an extension of STING to support user-defined trigger. In spatial databases, object insertion, deletion, and update are primitive events. Observation: Usually, only the cumulative effect of a set of primitive events may cause the trigger condition to be true. We refer such set of primitive events to as a composite events. . +

STING+ Condition-Action paradigm In general, it is difficult or even impossible for user to specify all possible composite events that may cause the trigger condition to be true. In general, evaluating a user-defined trigger T usually involves two aspects: Find a set of composite events E(s) that may cause the trigger condition CT to become true. Each time some composite event in E(s) occurs, check the status (false or true) of CT (given that CT was false previously).

STING+ Observation: As a side effect of the occurrence of some composite event, the set of composite events E(s) that could cause CT to transition from false to true might also evolve over time. Two set of composite events we need to consider: the set of composite events E(s) that can cause CT to become true need to re-evaluate CT the set of composite events F(s) that can cause a change to E(s) need to update E(s) . + . +

STING+ Observation: In spatial databases, the effect of an event is usually local to its neighborhood. STING+ decomposes the user-defined trigger into a set of sub-triggers associated with individual cells in the hierarchical structure. These sub-triggers are used to monitor composite events in E(s) and F(s) and change accordingly when E(s) and F(s) evolves. . x o1 o2 . . Level 4 Level 3

STING+ . . Updates are suspended at some level in the hierarchy until such time that the cumulative effect of these updates might cause the trigger condition to become satisfied. Level 2 Level 1 . + Level 1 . + Level 1 . + . + Level 2 Level 3

STING+ Example: Trigger bandwidth reallocation when the total area occupied by those regions in California where at least 10 cellular phones are in use per squared mile and the average length of phone calls is at least 15 minutes with total area at least 50 squared miles increases by at least 10 squared miles. DEFINE TRIGGER example ON cellular-phone WHEN SELECT SIZE(REGION) INCREASE RANGE (10, ) WHERE DENSITY IN RANGE (10, ) AND AVERAGE(length) IN RANGE (15, ) AND AREA IN RANGE (50, ) LOCATION California DO bandwidth-reallocation

STING+ Observation: Trigger condition CT is a conjunction of predicates P1  P2  …  Pn and can not be true if one predicate is false. They can be evaluated in a certain order: the ith predicate is tested when all previous i -1 predicates are true. The evaluation order should be chosen in such a way that the total cost is minimum. STING+ evaluates CT in the order {location, density condition, attribute condition}, each of which is evaluated in a different phase. Location only needs to be evaluated once and the cost can be regarded as constant in the trigger evaluation process. If the location is fixed, unnecessary sub-triggers set on cells outside the location can be avoided and hence save the evaluation cost of other predicates. Sub-triggers set during an earlier phase will exist longer than those set in a later phase. It is better to first evaluate the predicate that takes less time to handle. cost(density) < cost(attribute)

STING+ Average CPU cycles for handling each type of sub-trigger

STING+

Outline STING: a statistical information grid approach to spatial data mining STING+: an approach to active spatial data mining PK-tree: a spatial index structure for high dimensional point data Temporal spatial pattern detection

PK-tree As both the number of objects and the number of attributes are very large, it is essential to organize the set of objects by some dynamic indexing structure. Point index methods

PK-tree . Level 0 . Level 1 . Level 2 . Level 3 Spatial decomposition: Space is recursively divided until a level LD such that each cell contains at most one point.

PK-tree . . . Example of a PR Quad-tree: b4 c3 e1 e2 f3 g2 h2 g3 a7 g5 f6 f5 e5 d8 c8 d7 b8 b7 . Level 3 (LD) 1 2 3 4 5 6 7 8 a b c d e f g h 16 intermediate nodes, height = 3 1 2 3 4 5 6 7 8 . Level 2 a b c d e f g h K M N A B C D E F G H L 1 2 3 4 5 6 7 8 . Level 1 a b c d e f g h Q R T S root

Example of a PK-tree of rank 3: d1 b4 d2 c3 e1 e2 f3 g2 h2 g3 a7 g5 f6 f5 e5 d8 c8 d7 b8 b7 . Level 3 (LD) 1 2 3 4 5 6 7 8 a b c d e f g h 5 intermediate nodes, height = 2 1 2 3 4 5 6 7 8 . Level 1 a b c d e f g h Q R K N M M N K 1 2 3 4 5 6 7 8 . Level 2 a b c d e f g h root

PK-tree PK-tree employs a concept of K-instantiable cell to eliminate unnecessary nodes. Point cell: a non-empty cell at level LD A cell C is K-instantiable iff C is a point cell, or there does not exist (K-1) or less K-instantiable sub-cells to cover all non-empty space in C Only K-instantiable cells serve as nodes in the PK-tree (expect the root). The parent-child relationship follows naturally from the cell-subcell relationship.

PK-tree Properties: Bounds on node’s outdegree allows allocating one node to a page Bounded storage space Existence and Uniqueness enables us to analyze the behavior of a PK-tree easier. Expected height log(N) under some general condition guarantees efficiency of retrieval and update. No overlapping among sibling nodes efficient retrieval Empirical studies shown that the PK-tree outperforms SR-tree and X-tree by a wide margin.

PK-tree Height of generated trees on 100,000 points Size of index in MB

PK-tree KNN query on clustered data distribution

PK-tree Real data set: NASA Sky Telescope Data 200,000 two-dimensional points (they are the coordinates of crater locations on the surface of Mars)

Outline STING: a statistical information grid approach to spatial data mining STING+: an approach to active spatial data mining PK-tree: a spatial index structure for high dimensional point data Temporal spatial pattern detection

Temporal Spatial Pattern Detection When the number of attributes is large and/or the value of attributes evolve frequently, the complexity of patterns and the number of potential patterns increase. It is not desirable or even feasible to ask the user specify interesting patterns. E.g., the user wants to know any possible patterns involving certain attributes such as salary, rent, cellular phone usage, etc. Existing association rule algorithm can not be applied. Continuous attribute domain Temporal evolution Prior knowledge about relationships among attributes and objects

Temporal Spatial Pattern Detection Object  represented by point primitive attributes spatial attributes, i.e., coordinates of its position non-spatial attributes, e.g., name, weight, height, salary, rent derived attributes  derived from primitive attribute(s) environment attributes, e.g., distance to a hospital, average income in the neighborhood area Consider a sequence of snapshots S1, S2, …, Sn Temporal Spatial Pattern describes a possible relationship among evolution of attributes E.g., if the user want to know patterns involving salary and distance to big city, then one interesting pattern would be “people receiving a raise tends to move further away from the big city from 1987 to 1993.”.

Temporal Spatial Pattern Detection More complicated patterns Patterns on clustering evolution Patterns of high order Patterns whose cause and consequence do not happen together There is a delay for the consequence to show up. Patterns involving relationships among objects e.g., people who live far away from any doctor tend to move to a place closer to some doctor. Environment variables evolve independently over time.