What we have covered? Indexing and Hashing Data warehouse and OLAP

Slides:



Advertisements
Similar presentations
1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,
Advertisements

Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Multidimensional Indexing
Access Methods for Advanced Database Applications.
Searching on Multi-Dimensional Data
Chapter 9. Mining Complex Types of Data
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Spatial Mining.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Spatial Indexing I Point Access Methods. PAMs Point Access Methods Multidimensional Hashing: Grid File Exponential growth of the directory Hierarchical.
2-dimensional indexing structure
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
B+-tree and Hashing.
Spatial Indexing SAMs. Spatial Access Methods PAMs Grid File kd-tree based (LSD-, hB- trees) Z-ordering + B+-tree R-tree Variations: R*-tree, Hilbert.
Spatial Indexing SAMs.
I/O-Algorithms Lars Arge University of Aarhus March 1, 2005.
Multi-dimensional Indexes
I/O-Algorithms Lars Arge Spring 2009 March 3, 2009.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Chapter 3: Data Storage and Access Methods
Spatial Indexing I Point Access Methods.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
 Spatial data requires special data structures, similar to B-trees.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
R-TREES: A Dynamic Index Structure for Spatial Searching by A. Guttman, SIGMOD Shahram Ghandeharizadeh Computer Science Department University of.
CS4432: Database Systems II
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.
Data Structures for Computer Graphics Point Based Representations and Data Structures Lectured by Vlastimil Havran.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Trees for spatial data representation and searching
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Database management Systems, 3ed, R. Ramakrishnan and J. Gehrke1 Spatial Data Management Chapter 28.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
Mutlidimensional Indices Instructor: Randal Burns Lecture for 29 November 2005 Computer Science Johns Hopkins University.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
Spatial Query Processing Spatial DBs do not have a set of operators that are considered to be basic elements in a query evaluation. Spatial DBs handle.
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
Spatial Database 2/5/2011 Reference – Ramakrishna Gerhke and Silbershatz.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Lecture 3: External Memory Indexing Structures (Contd) CS6931 Database Seminar.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
Overview of Mining Spatial Data
Spatial Data Management
Multidimensional Access Structures
Tree-Structured Indexes
Spatial Indexing I Point Access Methods.
Hash-Based Indexes Chapter 11
B+-Trees and Static Hashing
Hash-Based Indexes Chapter 10
Multidimensional Indexes
B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.
Spatial Indexing I R-trees
Database Design and Programming
Storage and Indexing.
General External Merge Sort
Spatial Data Management
Chapter 11 Instructor: Xin Zhang
Tree-Structured Indexes
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

What we have covered? Indexing and Hashing Data warehouse and OLAP Data Mining Information Retrieval and Web Mining XML and XQuery Spatial Databases Transaction Management

Lecture 6: Spatial Data Management

Types of Spatial Data Point Data Region Data Points in a multidimensional space E.g., Raster data such as satellite imagery, where each pixel stores a measured value E.g., Feature vectors extracted from text Region Data Objects have spatial extent with location and boundary DB typically uses geometric approximations constructed using line segments, polygons, etc., called vector data.

Applications of Spatial Data Geographic Information Systems (GIS) E.g., ESRI’s ArcInfo; OpenGIS Consortium Geospatial information All classes of spatial queries and data are common Computer-Aided Design/Manufacturing Store spatial objects such as surface of airplane fuselage Range queries and spatial join queries are common Multimedia Databases Images, video, text, etc. stored and retrieved by content First converted to feature vector form; high dimensionality Nearest-neighbor queries are the most common

Types of Spatial Queries Spatial Range Queries Find all cities within 50 miles of Madison Query has associated region (location, boundary) Answer includes overlapping or contained data regions Nearest-Neighbor Queries Find the 10 cities nearest to Madison Results must be ordered by proximity Spatial Join Queries Find all cities near a lake Expensive, join condition involves regions and proximity

Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical (tree-based) structures Multidimensional Hashing Space filling curve SAM: index both points and regions Transformations Overlapping regions Clipping methods (non-overlapping) Data partitioning vs Space partitioning

Single-Dimensional Indexes B+ trees are fundamentally single-dimensional indexes. When we create a composite search key B+ tree, e.g., an index on <age, sal>, we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal. 80 70 60 Consider entries: <11, 80>, <12, 10> <12, 20>, <13, 75> 50 SAL 40 B+ tree order 30 20 10 11 12 13 AGE 2

Multidimensional Indexes A multidimensional index clusters entries so as to exploit “nearness” in multidimensional space. Keeping track of entries and maintaining a balanced index structure presents a challenge! Spatial clusters 70 60 50 40 30 20 10 80 B+ tree order 11 12 13 Consider entries: <11, 80>, <12, 10> <12, 20>, <13, 75> 2

Motivation for Multidimensional Indexes Spatial queries (GIS, CAD). Find all hotels within a radius of 5 miles from the conference venue. Find the city with population 500,000 or more that is nearest to Kalamazoo, MI. Find all cities that lie on the Nile in Egypt. Find all parts that touch the fuselage (in a plane design). Similarity queries (content-based retrieval). Given a face, find the five most similar faces. Multidimensional range queries. 50 < age < 55 AND 80K < sal < 90K

What’s the difficulty? An index based on spatial location needed. One-dimensional indexes don’t support multidimensional searching efficiently. (Why?) Hash indexes only support point queries; want to support range queries as well. Must support inserts and deletes gracefully. Ideally, want to support non-point data as well (e.g., lines, shapes).

PAMs Point Access Methods Hierarchical methods: kd-tree based Space Filling Curves: Z-ordering Multidimensional Hashing: Grid File Exponential growth of the directory

The problem Given a point set and a rectangular query, find the points enclosed in the query We allow insertions/deletions on line Query

Tree-based PAMs Most of tb-PAMs are based on kd-tree kd-tree is a main memory binary tree for indexing k-dimensional points Needs to be adapted for the disk model Levels rotate among the dimensions, partitioning the space based on a value for that dimension kd-tree is not necessarily balanced

kd-tree At each level we use a different dimension x=5 C y=6 B y=3 E

Kd-tree properties Height of the tree O(log2 n) Search time for exact match: O(log2 n) Search time for range query: O(n1/2 + k)

kd-tree example X=5 X=3 X=7 y=6 y=5 Y=6 x=8 x=7 x=3 y=2 Y=2 X=5 X=8

External memory kd-trees Similar to B-tree, tree nodes split many ways instead of two ways insertion becomes quite complex and expensive. No storage utilization guarantee since when a higher level node splits, the split has to be propagated all the way to leaf level resulting in many empty blocks. Pack many interior nodes (forming a subtree) into a block. it may not be feasible to group nodes at lower level into a block productively. Many interesting papers on how to optimally pack nodes into blocks recently published.

PAMs Point Access Methods Hierarchical methods: kd-tree based Space Filling Curves: Z-ordering Multidimensional Hashing: Grid File Exponential growth of the directory

Single-Dimensional Indexes B+ trees are fundamentally single-dimensional indexes. When we create a composite search key B+ tree, e.g., an index on <age, sal>, we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal. 80 70 60 Consider entries: <11, 80>, <12, 10> <12, 20>, <13, 75> 50 SAL 40 B+ tree order 30 20 10 11 12 13 AGE 2

Z-Curve What is a Z-curve? A space filling curve Fig 4.6 Fig 4.4 Generated from interleaving bits x, y coordinate See Fig. 4.6 Alternative generation method see Fig. 4.5 Connecting points by z-order see Fig. 4.4 looks like Ns or Zs Implementing file operations Fig 4.6 Fig 4.4

Example of Z-values Figure 4.7 Left part shows a map with spatial object A, B, C Right part and Left bottom part Z-values within A, B and C Note C gets z-values of 2 and 8, which are not close Exercise: Compute z-values for B. Fig 4.7

Hilbert Curve A space filling curve Fig 4.5 More complex to generate Example: Fig. 4.5 More complex to generate due to rotations Illustration on next slide! Implementing file operations Fig 4.5

Calculating Hilbert Values (Optional Topic) Fig 4.8

PAMs Point Access Methods Hierarchical methods: kd-tree based Space Filling Curves: Z-ordering Multidimensional Hashing: Grid File Exponential growth of the directory

Grid File Hashing methods for multidimensional points (extension of Extensible hashing) Idea: Use a grid to partition the space each cell is associated with one page Two disk access principle (exact match)

Grid File Start with one bucket for the whole space. Select dividers along each dimension. Partition space into cells Dividers cut all the way. Each cell corresponds to 1 disk page. Many cells can point to the same page. Cell directory potentially exponential in the number of dimensions

Grid File Implementation Dynamic structure using a grid directory Grid array: a 2 dimensional array with pointers to buckets (this array can be large, disk resident) G(0,…, nx-1, 0, …, ny-1) Linear scales: Two 1 dimensional arrays that used to access the grid array (main memory) X(0, …, nx-1), Y(0, …, ny-1)

Example Buckets/Disk Blocks Grid Directory Linear scale Y Linear scale X

Grid File Search Exact Match Search: at most 2 I/Os assuming linear scales fit in memory. First use liner scales to determine the index into the cell directory access the cell directory to retrieve the bucket address (may cause 1 I/O if cell directory does not fit in memory) access the appropriate bucket (1 I/O) Range Queries: use linear scales to determine the index into the cell directory. Access the cell directory to retrieve the bucket addresses of buckets to visit. Access the buckets.

Grid File Insertions Determine the bucket into which insertion must occur. If space in bucket, insert. Else, split bucket how to choose a good dimension to split? If bucket split causes a cell directory to split do so and adjust linear scales. insertion of these new entries potentially requires a complete reorganization of the cell directory--- expensive!!!

Grid File Deletions Deletions may decrease the space utilization. Merge buckets We need to decide which cells to merge and a merging threshold Buddy system and neighbor system A bucket can merge with only one buddy in each dimension Merge adjacent regions if the result is a rectangle

Grid File Example (N=6) A 1 6 2 A 1 2 3 4 5 6 5 3 4

Grid File Example (N=6) A B A A A B 1 2 3 4 5 6 1 3 5 7 2 4 6 8 10 9 11 12 12 3 10 11 4

Grid File Example (N=6) A B A B C A B C A B 1 7 8 13 2 4 6 9 11 12 3 5 14 7 8 9 10 11 12 1 2 3 4 5 6 13 15 1 7 8 13 A 2 4 6 9 11 12 B 3 5 10 C 1 3 5 7 8 10 A 2 4 6 9 11 12 B 14 15

Grid File Example (N=6) A B C A D B C A B C A B C D 1 3 5 7 8 10 2 4 6 9 10 11 12 1 2 3 4 5 6 13 14 15 16 1 3 5 7 8 10 A 2 4 6 9 11 12 B 13 C 14 15 1 2 3 4 5 6 A 7 B 8 13 9 11 12 10 C 16 14 15 D

Grid File Example (N=6) x1 x2 x3 x4 y4 y2 y1 A B C D E F G H I y3

The R-Tree The R-tree is a tree-structured index that remains balanced on inserts and deletes. Each key stored in a leaf entry is intuitively a box, or collection of intervals, with one interval per dimension. Example in 2-D: X Y Root of R Tree Leaf level

R-Tree Properties Leaf entry = < n-dimensional box, rid > key value being a box. Box is the tightest bounding box for a data object. Non-leaf entry = < n-dim box, ptr to child node > Box covers all boxes in child node (in fact, subtree). All leaves at same distance from root. Nodes can be kept 50% full (except root). Can choose a parameter m that is <= 50%, and ensure that every node is at least m% full.

Example of an R-Tree Leaf entry Index entry Spatial object approximated by bounding box R8 R3 R5 R13 R9 R8 R14 R10 R12 R7 R18 R17 R6 R16 R19 R15 R2

Example R-Tree (Contd.)

Search for Objects Overlapping Box Q Start at root. 1. If current node is non-leaf, for each entry <E, ptr>, if box E overlaps Q, search subtree identified by ptr. 2. If current node is leaf, for each entry <E, rid>, if E overlaps Q, rid identifies an object that might overlap Q. Note: May have to search several subtrees at each node! (In contrast, a B-tree equality search goes to just one leaf.)

Improving Search Using Constraints It is convenient to store boxes in the R-tree as approximations of arbitrary regions, because boxes can be represented compactly. But why not use convex polygons to approximate query regions more accurately? Will reduce overlap with nodes in tree, and reduce the number of nodes fetched by avoiding some branches altogether. Cost of overlap test is higher than bounding box intersection, but it is a main-memory cost, and can actually be done quite efficiently. Generally a win.

Insert Entry <B, ptr> Start at root and go down to “best-fit” leaf L. Go to child whose box needs least enlargement to cover B; resolve ties by going to smallest area child. If best-fit leaf L has space, insert entry and stop. Otherwise, split L into L1 and L2. Adjust entry for L in its parent so that the box now covers (only) L1. Add an entry (in the parent node of L) for L2. (This could cause the parent node to recursively split.)

Splitting a Node During Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and L2. Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries. Idea: Redistribute so as to minimize area of L1 plus area of L2. GOOD SPLIT! BAD!

Spatial Data Warehousing Spatial data warehouse: Integrated, subject-oriented, time-variant, and nonvolatile spatial data repository for data analysis and decision making Spatial data integration: a big issue Structure-specific formats (raster- vs. vector-based, OO vs. relational models, different storage and indexing, etc.) Vendor-specific formats (ESRI, MapInfo, Integraph, etc.) Spatial data cube: multidimensional spatial database Both dimensions and measures may contain spatial components

Dimensions and Measures in Spatial Data Warehouse numerical distributive (e.g. count, sum) algebraic (e.g. average) holistic (e.g. median, rank) spatial collection of spatial pointers (e.g. pointers to all regions with 25-30 degrees in July) Dimension modeling nonspatial e.g. temperature: 25-30 degrees generalizes to hot spatial-to-nonspatial e.g. region “B.C.” generalizes to description “western provinces” spatial-to-spatial e.g. region “Burnaby” generalizes to region “Lower Mainland”

Example: BC weather pattern analysis Input A map with about 3,000 weather probes scattered in B.C. Daily data for temperature, precipitation, wind velocity, etc. Concept hierarchies for all attributes Output A map that reveals patterns: merged (similar) regions Goals Interactive analysis (drill-down, slice, dice, pivot, roll-up) Fast response time Minimizing storage space used Challenge A merged region may contain hundreds of “primitive” regions (polygons)

Star Schema of the BC Weather Warehouse Spatial data warehouse Dimensions region_name time temperature precipitation Measurements region_map area count Dimension table Fact table

Spatial Merge Precomputing all: too much storage space On-line merge: very expensive

Methods for Computation of Spatial Data Cube On-line aggregation: collect and store pointers to spatial objects in a spatial data cube expensive and slow, need efficient aggregation techniques Precompute and store all the possible combinations huge space overhead Precompute and store rough approximations in a spatial data cube accuracy trade-off Selective computation: only materialize those which will be accessed frequently a reasonable choice

Spatial Association Analysis Spatial association rule: A  B [s%, c%] A and B are sets of spatial or nonspatial predicates Topological relations: intersects, overlaps, disjoint, etc. Spatial orientations: left_of, west_of, under, etc. Distance information: close_to, within_distance, etc. s% is the support and c% is the confidence of the rule Examples is_a(x, large_town) ^ intersect(x, highway) ® adjacent_to(x, water) [7%, 85%] is_a(x, large_town) ^adjacent_to(x, georgia_strait) ® close_to(x, u.s.a.) [1%, 78%]

Progressive Refinement Mining of Spatial Association Rules Hierarchy of spatial relationship: g_close_to: near_by, touch, intersect, contain, etc. First search for rough relationship and then refine it Two-step mining of spatial association: Step 1: Rough spatial computation (as a filter) Using MBR or R-tree for rough estimation Step2: Detailed spatial algorithm (as refinement) Apply only to those objects which have passed the rough spatial association test (no less than min_support)

Spatial Classification and Spatial Trend Analysis Analyze spatial objects to derive classification schemes, such as decision trees in relevance to certain spatial properties (district, highway, river, etc.) Example: Classify regions in a province into rich vs. poor according to the average family income Spatial trend analysis Detect changes and trends along a spatial dimension Study the trend of nonspatial or spatial data changing with space Example: Observe the trend of changes of the climate or vegetation with the increasing distance from an ocean

LSD-tree Local Split Decision – tree Use kd-tree to partition the space. Each partition contains up to B points. The kd-tree is stored in main-memory. If the kd-tree (directory) is large, we store a sub-tree on disk Goal: the structure must remain balanced: external balancing property

Example: LSD-tree

LSD-tree: main points Split strategies: Paging algorithm Data dependent Distribution dependent Paging algorithm Two types of splits: bucket splits and internal node splits

Handling Regions with Z-curve Fig 4.9