Guofeng Cao CyberInfrastructure and Geospatial Information Laboratory Department of Geography National Center for Supercomputing Applications (NCSA) University.

Slides:



Advertisements
Similar presentations
1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,
Advertisements

Disk Storage, Basic File Structures, and Hashing
Nearest Neighbor Search
Spatial Data Structures Hanan Samet Computer Science Department
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Multidimensional Indexing
Multidimensional Data
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Quadtrees Raster and vector.
CS447/ Realistic Rendering -- Solids Modeling -- Introduction to 2D and 3D Computer Graphics.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
BTrees & Bitmap Indexes
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
B+-tree and Hashing.
CPSC 231 B-Trees (D.H.)1 LEARNING OBJECTIVES Problems with simple indexing. Multilevel indexing: B-Tree. –B-Tree creation: insertion and deletion of nodes.
Other time considerations Source: Simon Garrett Modifications by Evan Korth.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Spatial Information Systems (SIS) COMP Raster-based structures (1)
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Spatial Indexing I Point Access Methods.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Structures and Access Methods
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Information Systems (SIS) COMP Spatial access methods: Indexing (part 2)
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
DISK STORAGE INDEX STRUCTURES FOR FILES Lecture 12.
Indexing and Hashing.
R-Trees: A Dynamic Index Structure for Spatial Data Antonin Guttman.
Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
Data Structures for Computer Graphics Point Based Representations and Data Structures Lectured by Vlastimil Havran.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 5, 6 of Elmasri “ How index-learning turns no student.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Trees for spatial data representation and searching
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
Starting at Binary Trees
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture17.
Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
1 Multi-Level Indexing and B-Trees. 2 Statement of the Problem When indexes grow too large they have to be stored on secondary storage. However, there.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Internal and External Sorting External Searching
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
Chapter 5 Record Storage and Primary File Organizations
Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.
Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –
Spatial Data Management
Mehdi Kargar Department of Computer Science and Engineering
Indexing Structures for Files and Physical Database Design
Multiway Search Trees Data may not fit into main memory
Multidimensional Access Structures
Spatial Indexing I Point Access Methods.
Disk Storage, Basic File Structures, and Hashing
Indexing and Hashing Basic Concepts Ordered Indices
Multidimensional Search Structures
Presentation transcript:

Guofeng Cao CyberInfrastructure and Geospatial Information Laboratory Department of Geography National Center for Supercomputing Applications (NCSA) University of Illinois at Urbana-Champaign Geog 480: Principles of GIS - Data Structures and Indexing

Physical Data Storage - Disk

Databases typically organize data into files, each containing a collection of records The atomic unit of data on a disk is a block The time taken to read or write a block has three components: o Seek time: the time taken for mechanical movement of the read heads o Latency: the time taken to rotate the disks into the correct position o Transfer time: the time taken to transfer the block to/from the CPU Physical Data Management Database structures that lessen seek time and latency improve performance. Performance

File Organization Field: a named place for a data item in a record (cf. attribute) Record: a sequence of fields related to a single logical entity (cf. tuple); records are held by disk blocks File: a sequence of records usually of the same type (cf. relation) Database: a collection of related files

Ordered and Unordered Files In unordered files new records are inserted in the next physical location on the disk o Insertion is very efficient o Retrievals require search through every record in sequence: linear search with time complexity O(n) o Deletion causes “holes” to appear in sequence In ordered files each record is inserted in the order of the values of one or more of its fields o Slows the insertion of new records o Allows efficient binary search with time complexity O(log 2 n) on indexed field, but not on other fields

Binary Search Algorithm Input: An ordered file with an ordering field, placed on n disk blocks (labeled 1 to n), and a search value V low ← 1; high ← n; while high ≥ low do mid ← (low + high) div 2 read block mid into memory if V < value of ordering field in first record of block mid then high ← mid-1 else if V > value of ordering field in last record of block mid then low ← mid+1 else linear search block mid for records with value V in their ordering field, possibly proceeding to next block(s), then halt Output: Records from the file with value V in their ordering field

Binary Search Algorithm - Example

Indexes Physical file organization alone cannot solve all storage and retrieval problems An index is an auxiliary structure specifically designed to speed retrieval of records Indexes trade space for speed A single-level index is an ordered file with two fields: o An index field containing the ordered values of the indexing field in the data file o A pointer field containing the address of the disk blocks that have a particular index value Retrieving a record, based on an indexed search condition, requires binary search of the (ordered) index file

Student File Indexed by Last Name

B-Trees Maintaining index structure can be difficult A B-tree indexes linearly ordered data that may change frequently B-trees remain balanced, in that branches of the tree remain of equal length through modification Each node in a B-tree contains pointers to indexed records Additionally, internal nodes contain pointers to immediate descendents The value for a descendent node is within the range set by the parent node

Searching & Modifying a B-Tree Search: Begin search at root, continue until exact match or leaf is encountered Insert: Search to find position for new index record. If space, no restructuring required If overflow for non-root node, split node and promote middle value If overflow for root node, split node and demote extreme values Delete: similar to insert

B-Tree B+-trees, where pointers to records are only stored at leaf nodes, are more often used in practice A B-tree is completely balanced (path from root to leaf is constant) at all stages in its evolution Search time is bounded by the length of the path, and so is O(log n) Insertion and deletion of records require O(log n) time Each node is guaranteed to be at least half full (or almost half full with odd fan-out ratios) at all stages in a B-tree’s evolution B-Tree Properties

Spatial Indexes Previous examples have concerned multi-dimensional data where dimensions are essentially independent Although spatial dimensions are orthogonal, there is dependency between them in terms of the Euclidean metric IdSiteEastNorth 1Newcastle Museum1458 2Waterworld3165 3Gladstone Pottery Museum7423 4Trentham Gardens2000 5New Victoria Theater1855 6Beswick Pottery6625 7Coalport Pottery5436 8Spode Pottery3743 9Minton Pottery Royal Doulton Pottery City Museum Westport Lake Ford Green Hall Park Hall Country Park8644

Potteries Example

Spatial Queries Point query: retrieve all records with spatial references located at a particular point Range query: retrieve all records with spatial references located within a given range (spatial ranges may be any shape, but are often rectangular) Non spatial query: Retrieve the point location of Trentham Gardens Spatial point query: Retrieve any site at location (37, 43) Spatial range query: Retrieve any site in the rectangle defined by (20, 20)–(40, 50) Example

Potteries Indexes EastSite 14Newcastle Museum 17Westport Lake 18New Victoria Theater 20Trentham Gardens 31Waterworld 31Royal Doulton Pottery 36Minton Pottery 37Spode Pottery 41City Museum 53Ford Green Hall 54Coalport Pottery 66Beswick Pottery 74Gladstone Pottery Msm 86Park Hall Country Park NorthSite 00Trentham Gardens 23Gladstone Pottery Msm 25Beswick Pottery 36Coalport Pottery 39Minton Pottery 43Spode Pottery 44Park Hall Country Park 55New Victoria Theater 58Newcastle Museum 62City Museum 65Waterworld 87Royal Doulton Pottery 92Westport Lake 99Ford Green Hall

Potteries Indexes

Two- Dimensional Ordering Many common indexes assume a grid-based representation (tile indexes) Tile indexes aim to provide a path through the grid that visits each cell Indexes differ in how well they preserve proximity, i.e., cells that are spatially close are close in the index The main problem facing multidimensional spatial data structures is that data storage is essentially one- dimensional From one to two dimensions

Common Tile Indexes Row Peano-HilbertMortonSpiral Cantor DiagonalRow-Prime

Introduction to Raster Structures Rasters provide a fixed grid for storing data Cells are addressed using the row and column number Rasters may be used to represent a range of computable spatial objects, including: o A point represented by a single cell o A strand or polyline represented by a sequence of neighboring cells o A connected area represented by a continuous collection of cells Rasters may be stored as arrays, which are natural computable structures, but can be wasteful in terms of space

Freeman Chain Coding Freeman chain coding uses the numbers 0 to 7 arranged clockwise around the 8 directions N = 0, NE = 1, E = 2, SE = 3, S = 4, SW = 5, W = 6, NW = 7 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 6, 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 0, 6, 6, 0, 0, 0, 0, 0, 0] Example

Run Length Encoding Run length encoding (RLE) counts the length of “runs” of consecutive cells of the same value RLE relies on an underlying tile index: different tile indexes lead to different RLEs Example [18, 11, 5, 11, 5, 11, 5, 11, 5, 10, 6, 10, 6, 10, 8, 8, 8, 8, 8, 8, 8, 10, 6, 10, 6, 10, 6, 10, 18]

FCE and RLE Freeman chain encodings can be combined with run length encoding. E.g., [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 6, 4, 4, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 0, 6, 6, 0, 0, 0, 0, 0, 0] Becomes [2, 10, 4, 3, 6, 1, 4, 7, 2, 2, 4, 3, 6, 9, 0, 7, 6, 2, 0, 6]

Region Quadtrees Quadtree is a tree structure where every non-leaf node has exactly four descendents Region quadtrees recursively subdivide non-homogenous square arrays of cells into four equal sized quadrants Decomposition continues until all squares bound homogenous regions

Region Quadtrees

Quadtrees take full advantage of the spatial structure, adapt to variable spatial detail Inefficient for highly inhomogeneous rasters Very sensitive to changes in the embedding space (e.g., translation, rotation)

Quadtree Operations ComplementIntersection UnionDifference

Quadtree Intersection Algorithm Input: Binary quadtrees Q, R q ← root of Q, r ← root of R queue L ← [(q, r )] while L is not empty do remove the first node pair (x, y) from L if x or y is a white leaf then add white leaf to output quadtree S if x is a non-white leaf then add y and all subnodes to output quadtree S if y is a non-white leaf then add x and all subnodes to output quadtree S if x and y are non-leaf nodes then add a new non-leaf node to output quadtree S for pairwise descendants x' of x and y ' of y do add (x ', y ') to the end of L Output: A binary quadtree S that represents the intersection Q ∩R

Summary Physical file organization affects database performance Indexes are needed to go beyond the limitations of physical file organization Non-spatial indexes, like B-trees, are inadequate for storing spatial data The key issue in spatial indexes is representing two dimensional data in a one-dimensional index

Grid Structures: Fixed Grid Partition of planar region into equal sized cells Points sharing the same cell (bucket) are stored together Improves range query performance Partition size depends on: Number of points; and Magnitude of average range query. Poor performance with non- uniform point distribution

Grid Structures: Grid File Extends fixed grid with arbitrary subdivision positions, accounting for point distribution

Point Quadtree Combination of grid approach with multidimensional binary search tree Each non-leaf node has four descendents Each quadrant partition is centered on a data point Quadtree build time is O(n log n); search time is O(log n)

Point Quadtree

2D Tree Point quadtree leads to exponential increase in descendents in k dimensions 2D tree is a binary tree that trades tree breadth for depth Compares point alternately with respect to each dimension Structure depends on order of point insertion

2D Tree

PM(PM 1 ) Quadtree Divides region into quadtree, such that all edges and vertices are separated into distinct leaf nodes Each leaf node contains at most one vertex Leaves containing a vertex contain only edges incident with that vertex Leaves not containing a vertex contain only one edge

Rectangles and Minimum Bounding Boxes Minimum bounding box (MBB/MBR): the smallest rectangle bounding a shape with its axes parallel to the sides of the Cartesian frame Using MBB, some queries may be answered without retrieving the geometry of an object E.g., find all objects which lie entirely within a specified region

R-Tree Multidimensional dynamic spatial data structure similar to the B-tree Leaf nodes represent actual rectangles to be indexed Internal nodes represent smallest axes-parallel rectangle containing all descendents Rectangles at any level may overlap Good subdivisions: o Minimize the total area of containing rectangles o Minimize the total area of overlap of containing rectangles Overlap is critical: point and range searches are inefficient with large overlap (R+-tree aims to eliminate overlaps)

R-Tree

R + -Tree

QTM Spherical tessellations provide closer approximation to surface of the Earth Octahedral tessellation is the only regular tessellation that can be oriented with vertices at the poles and edges at the equator Quaternary triangular mesh (QTM) approximates the surface of the globe

QTM

QTM

Summary Point data structures must balance independence from embedding of points (e.g., grid file) and efficient indexes for inhomogeneous point distributions (e.g., point quadtree) MBBs provide useful spatial descriptors of a complex spatial object, which can be indexed in place of the object itself. R-tree and related indexes are amongst the most important spatial indexes in practical GIS