 Spatial data requires special data structures, similar to B-trees.

Slides:



Advertisements
Similar presentations
1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,
Advertisements

Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Access Methods for Advanced Database Applications.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Query Processing in Databases Dr. M. Gavrilova.  Introduction  I/O algorithms for large databases  Complex geometric operations in graphical querying.
2-dimensional indexing structure
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
Spatial Access Methods Chapter 26 of book Read only 26.1, 26.2, 26.6 Dr Eamonn Keogh Computer Science & Engineering Department University of California.
Spatial Indexing SAMs. Spatial Access Methods PAMs Grid File kd-tree based (LSD-, hB- trees) Z-ordering + B+-tree R-tree Variations: R*-tree, Hilbert.
Project Proposals Simonas Šaltenis Aalborg University Nykredit Center for Database Research Department of Computer Science, Aalborg University.
Tree-Structured Indexes. Introduction v As for any index, 3 alternatives for data entries k* : À Data record with key value k Á Â v Choice is orthogonal.
Spatial Indexing SAMs.
I/O-Algorithms Lars Arge University of Aarhus March 1, 2005.
Multi-dimensional Indexes
Retrieving Data from Disk – Access Methods Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 19, 2005 Slides.
Spatial Information Systems (SIS) COMP Spatial access methods: Indexing.
1 R-Trees for Spatial Indexing Yanlei Diao UMass Amherst Feb 27, 2007 Some Slide Content Courtesy of J.M. Hellerstein.
Chapter 3: Data Storage and Access Methods
Spatio-Temporal Databases. Introduction Spatiotemporal Databases: manage spatial data whose geometry changes over time Geometry: position and/or extent.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatio-Temporal Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases …..
What we have covered? Indexing and Hashing Data warehouse and OLAP
R-TREES: A Dynamic Index Structure for Spatial Searching by A. Guttman, SIGMOD Shahram Ghandeharizadeh Computer Science Department University of.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Storage and Indexing February 26 th, 2003 Lecture 19.
 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Trees for spatial data representation and searching
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Database management Systems, 3ed, R. Ramakrishnan and J. Gehrke1 Spatial Data Management Chapter 28.
Database System Concepts ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com 1 Indexing Spatial Data.
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
R-Tree. 2 Spatial Database (Ia) Consider: Given a city map, ‘index’ all university buildings in an efficient structure for quick topological search.
Antonin Guttman In Proceedings of the 1984 ACM SIGMOD international conference on Management of data (SIGMOD '84). ACM, New York, NY, USA.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Spatial and Geographic Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria University (Karachi Campus)
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
Spatial Database 2/5/2011 Reference – Ramakrishna Gerhke and Silbershatz.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Spatial and Geographic Databases. Spatial databases store information related to spatial locations, and support efficient storage, indexing and querying.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Content based on Chapter 10 Database Management Systems, (3 rd.
Spatio-Temporal Databases. Term Project Groups of 2 students You can take a look on some project ideas from here:
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.
Jeremy Iverson & Zhang Yun 1.  Chapter 6 Key Concepts ◦ Structures and access methods ◦ R-Tree  R*-Tree  Mobile Object Indexing  Questions 2.
Tree-Structured Indexes. Introduction As for any index, 3 alternatives for data entries k*: – Data record with key value k –  Choice is orthogonal to.
1 R-Trees Guttman. 2 Introduction Range queries in multiple dimensions: Computer Aided Design (CAD) Geo-data applications Support special data objects.
Spatial Data Management
Spatio-Temporal Databases
Azita Keshmiri CS 157B Ch 12 indexing and hashing
COP Introduction to Database Structures
Chapter 25: Advanced Data Types and New Applications
Query Processing in Databases Dr. M. Gavrilova
Spatio-Temporal Databases
B+-Trees and Static Hashing
B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.
Spatial Indexing I R-trees
Storage and Indexing.
General External Merge Sort
Spatial Data Management
Tree-Structured Indexes
Presentation transcript:

 Spatial data requires special data structures, similar to B-trees

 Types of spatial data  Example of spatial and geometric data – splines and Voronoi diagrams  One-dimensional index – interval trees  R-trees  T-tree variants

 Point Data ◦ Points in a multidimensional space ◦ E.g., Raster data such as satellite imagery, where each pixel stores a measured value ◦ E.g., Feature vectors extracted from text  Region Data ◦ Objects have spatial extent with location and boundary ◦ DB typically uses geometric approximations constructed using line segments, polygons, etc., called vector data.

 Different types of sampling are used to collect data

 Low-pass filter – the value for the cell is computed as average of other cells  High-pass-continuous surface –low pass

 Window size has effect on filtering

 Spatial Range Queries ◦ Find all cities within 50 miles of Madison ◦ Query has associated region (location, boundary) ◦ Answer includes ovelapping or contained data regions  Nearest-Neighbor Queries ◦ Find the 10 cities nearest to Madison ◦ Results must be ordered by proximity  Spatial Join Queries ◦ Find all cities near a lake ◦ Expensive, join condition involves regions and proximity

 Geographic Information Systems (GIS) ◦ E.g., ESRI’s ArcInfo; OpenGIS Consortium ◦ Geospatial information ◦ All classes of spatial queries and data are common  Computer-Aided Design/Manufacturing ◦ Store spatial objects such as surface of airplane fuselage ◦ Range queries and spatial join queries are common  Multimedia Databases ◦ Images, video, text, etc. stored and retrieved by content ◦ First converted to feature vector form; high dimensionality ◦ Nearest-neighbor queries are the most common

 B+ trees are fundamentally single-dimensional indexes.  When we create a composite search key B+ tree, e.g., an index on, we effectively linearize the 2- dimensional space since we sort entries first by age and then by sal. Consider entries:, B+ tree order

 A multidimensional index clusters entries so as to exploit “nearness” in multidimensional space.  Keeping track of entries and maintaining a balanced index structure presents a challenge! Consider entries:, Spatial clusters B+ tree order

 Spatial queries (GIS, CAD). ◦ Find all hotels within a radius of 5 miles from the conference venue. ◦ Find the city with population 500,000 or more that is nearest to Kalamazoo, MI. ◦ Find all cities that lie on the Nile in Egypt. ◦ Find all parts that touch the fuselage (in a plane design).  Multidimensional range queries. ◦ 50 < age < 55 AND 80K < sal < 90K

 Similarity queries (content-based retrieval). ◦ Given a face, find the five most similar faces/expressions.

 Geometric, 1-dimensional tree  Interval is defined by (x 1,x 2 )  Split at the middle (5), again at the middle (3,7), again at the middle (2,8)  All intervals intersecting a middle point are stored at the corresponding root (4,6) (4,8) (6,9) (7.5,8.5) (2,4)

 Finding intervals – by finding x 1, x 2 against the nodes  Find interval containing specific value – from the root  Sort intervals within each node of the tree according to their coorsinates  Cost of the “stabbing query”– finding all intervals containing the specified value is O(log n + k), where k is the number of reported intervals.

 Constructs the minimal bounding box (mbb)  Check validity (predicate) on mbb  Refinement step verifies if actual objects satisfy the predicate.

 An index based on spatial location needed. ◦ One-dimensional indexes don’t support multidimensional searching efficiently. ◦ Hash indexes only support point queries; want to support range queries as well. ◦ Must support inserts and deletes gracefully.  Ideally, want to support non-point data as well (e.g., lines, shapes).  The R-tree meets these requirements, and variants are widely used today.

 The R-tree is a tree-structured index that remains balanced on inserts and deletes.  Each key stored in a leaf entry is intuitively a box, or collection of intervals, with one interval per dimension.  Example in 2-D: X Y Root of R Tree Leaf level

 Leaf entry = ◦ This is Alternative (2), with key value being a box. ◦ Box is the tightest bounding box for a data object.  Non-leaf entry = ◦ Box covers all boxes in child node (in fact, subtree).  All leaves at same distance from root.  Nodes can be kept 50% full (except root). ◦ Can choose a parameter m that is <= 50%, and ensure that every node is at least m% full.

R8 R9 R10 R11 R12 R17 R18 R19 R13 R14 R15 R16 R1 R2 R3 R4 R5 R6 R7 Leaf entry Index entry Spatial object approximated by bounding box R8

R1R2 R3R4R5R6R7 R8R9R10R11R12R13R14R15 R16 R17R18R19

Start at root. 1. If current node is non-leaf, for each entry, if box E overlaps Q, search subtree identified by ptr. 2. If current node is leaf, for each entry, if E overlaps Q, rid identifies an object that might overlap Q. Note: May have to search several subtrees at each node! (In contrast, a B-tree equality search goes to just one leaf.)

 It is convenient to store boxes in the R-tree as approximations of arbitrary regions, because boxes can be represented compactly.  But why not use convex polygons to approximate query regions more accurately? ◦ Will reduce overlap with nodes in tree, and reduce the number of nodes fetched by avoiding some branches altogether. ◦ Cost of overlap test is higher than bounding box intersection, but it is a main-memory cost, and can actually be done quite efficiently. Generally a win.

 Start at root and go down to “best-fit” leaf L. ◦ Go to child whose box needs least enlargement to cover B; resolve ties by going to smallest area child.  If best-fit leaf L has space, insert entry and stop. Otherwise, split L into L1 and L2. ◦ Adjust entry for L in its parent so that the box now covers (only) L1. ◦ Add an entry (in the parent node of L) for L2. (This could cause the parent node to recursively split.)

 The entries in node L plus the newly inserted entry must be distributed between L1 and L2.  Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries.  Idea: Redistribute so as to minimize area of L1 plus area of L2. Exhaustive algorithm is too slow; quadratic and linear heuristics are popular in research. GOOD SPLIT! BAD!

 The R* tree uses the concept of forced reinserts to reduce overlap in tree nodes. When a node overflows, instead of splitting: ◦ Remove some (say, 30% of the) entries and reinsert them into the tree. ◦ Could result in all reinserted entries fitting on some existing pages, avoiding a split.  R* trees also use a different heuristic, minimizing box perimeters rather than box areas during insertion.  Another variant, the R+ tree, avoids overlap by inserting an object into multiple leaves if necessary. ◦ Searches now take a single path to a leaf, at cost of redundancy.

 The Generalized Search Tree (GiST) abstracts the “tree” nature of a class of indexes including B+ trees and R-tree variants. ◦ Striking similarities in insert/delete/search and even concurrency control algorithms make it possible to provide “templates” for these algorithms that can be customized to obtain the many different tree index structures. ◦ B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs. ◦ GiST provides an alternative for implementing other tree indexes.

 Typically, high-dimensional datasets are collections of points, not regions. ◦ E.g., Feature vectors in multimedia applications. ◦ Very sparse  Nearest neighbor queries are common. ◦ R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions.  As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases; “nearest neighbor” is not meaningful.

 Spatial data management has many applications, including GIS, CAD/CAM, multimedia indexing. ◦ Point and region data ◦ Overlap/containment and nearest-neighbor queries  Many approaches to indexing spatial data ◦ R-tree approach is widely used in GIS systems ◦ Other approaches include Grid Files, Quad trees, and techniques based on “space-filling” curves. ◦ For high-dimensional datasets, unless data has good “contrast”, nearest-neighbor may not be well-separated

 Deletion consists of searching for the entry to be deleted, removing it, and if the node becomes under-full, deleting the node and then re-inserting the remaining entries.  Overall, works quite well for 2 and 3 D datasets. Several variants (notably, R+ and R* trees) have been proposed; widely used.  Can improve search performance by using a convex polygon to approximate query shape (instead of a bounding box) and testing for polygon-box intersection.

 "Print Gallery," by M.C. Escher. Curious about the blank spot in the middle of Escher’s 1956 lithograph, Hendrik Lenstra set out to learn whether the artist had encountered a mathematical problem he couldn’t solve. ©2002 Cordon Art B.V., Baarn, Holland. All rights reserved.

 Fixed grid:  Stored as a 2D array, each entry contains a link to a list of points (object) stored in a grid. a,b

 Too many points in one grid cell: Solution A –overflow (linked list) Solution B- Split the cell and increase index!

 Rectangles may share different grid cells  Rectangle duplicates are stored  Grid cells are of fixed size

 In a grid file, the index is dynamically increased in size when overflow happens.  The space is split by a vertical or a horizontal line, and then further subdivided when overflow happens!  Index is dynamically growing  Boundaries of cells of different sizes are stores, thus point and stabbing queries are easy

 Instead of using an array as an index, use tree!  Quadtree decomposition – cells are indexed by using quaternary B-tree.  All cells are squares, not polygons.  Search in a tree is faster!

First three levels of a quad tree

 Project #32: PICTURE REPRESENTATION USING QUAD TREES, McGill University: McGill University: 8 x 8 pixel picture represented in a quad tree

 Example of a grid file

 B+ index – actual references to rectangles are stored in the leaves, saving more space+ access time  Label nodes according to Z or “pi” order

 Level of detail increases as the number of quadtree decompositions increases! Decompositions have indexes of a form: 00,01,02,03,10,11,12,13, 2, ,302,303,31,32,33 ◦ Stores as Bplus tree

 What is spatial data structure?  What is the difference between grid and grid file?  Explain how z or p ordering works?  Define interval trees  Provide example of R-tree  List R-tree variants  How spatial index structure differs from regular B+ tree?

 Text 1 instructor’s resources  McGill University web space  Wikepedia (z order images)  Face recognition research  SPARCS lab project on image processing