Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.

Slides:



Advertisements
Similar presentations
Nearest Neighbor Search
Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Multidimensional Indexing
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Searching on Multi-Dimensional Data
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
Multidimensional Data
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
COMP 451/651 B-Trees Size and Lookup Chapter 1.
Spatial Indexing I Point Access Methods. PAMs Point Access Methods Multidimensional Hashing: Grid File Exponential growth of the directory Hierarchical.
BTrees & Bitmap Indexes
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Spatial Indexing I Point Access Methods.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
1 Geometric index structures April 15, 2004 Based on GUW Chapter , [Arge01] Sections 1, 2.1 (persistent B- trees), 3-4 (static versions.
1 Lecture 19: B-trees and Hash Tables Wednesday, November 12, 2003.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
COMP 451/651 Multiple-key indexes
CS 277 – Spring 2002Notes 51 CS 277: Database System Implementation Arthur Keller Notes 5: Hashing and More.
Primary Indexes Dense Indexes
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #9.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.
Mutlidimensional Indices Instructor: Randal Burns Lecture for 29 November 2005 Computer Science Johns Hopkins University.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
Multi-dimensional Search Trees
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Chapter 5 Multidimensional Indexes. One dimensional index can be used to support multidimensional query. F1=‘abcd’ F2= 123‘abcd#123’
1 Lecture 21: Hash Tables Wednesday, November 17, 2004.
Physical Database Design I, Ch. Eick 1 Physical Database Design I Chapter 16 Simple queries:= no joins, no complex aggregate functions Focus of this Lecture:
Multi-dimensional Search Trees CS302 Data Structures Modified from Dr George Bebis.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
CS4432: Database Systems II
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –
Chapter 5. Multidimensional Indexes
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Multidimensional Access Structures
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Spatial Indexing I Point Access Methods.
COMP 430 Intro. to Database Systems
KD Tree A binary search tree where every node is a
Indexing and Hashing Basic Concepts Ordered Indices
Multidimensional Indexes
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Database Design and Programming
Lecture 11: B+ Trees and Query Execution
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Multidimensional Data

Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data about sales. - A sale is described by (store, day, item, color, size, etc.). Sale = point in 5­dim space. - A customer is described by (age, salary, pcode, marital­status, etc.). Typical Queries Range queries: "How many customers for gold jewelry have age between 45 and 55, and salary less than 100K?" Nearest neighbor : "If I am at coordinates (a,b), what is the nearest McDonalds." They are expressible in SQL. Do you see how?

SQL Range queries: “How many customers for gold jewelry have age between 45 and 55, and salary less than 100K?” SELECT * FROM Customers WHERE age>=45 AND age<=55 AND sal<100; Nearest neighbor : “If I am at coordinates (a,b), what is the nearest McDonalds.” Suppose we have a relation Points(x,y,name) SELECT * FROM Points p WHERE p.name=‘McDonalds’ AND NOT EXISTS ( SELECT * FROM POINTS q WHERE (q.x-a)*(q.x-a)+(q.y-b)*(q.y-b) < (p.x-a)*(p.x-a)+(p.y-b)*(p.y-b) AND q.name=‘McDonalds’ );

Big Impediment For these types of queries, there is no clean way to eliminate lots of records that don't meet the condition of the WHERE­clause. An Approach for range queries Index on attributes independently. - Intersect pointers in main memory to save disk I/O.

Attempt at using B-trees for MD-queries Database = 1,000,000 points evenly distributed in a 1000×1000 square. Stored in 10,000 blocks (100 recs per block) B-tree secondary indexes on x and on y Range query {(x,y) : 450  x  550, 450  y  550} 100,000 pointers (i.e. 1,000,000/10) for the x range, and same for y 10,000 pointers for answer (found by pointer intersection) Retrieve 10,000 records. If they are stored randomly we need to do 10,000 I/O’s. Add here the cost of B-Trees: Root of each B-tree in main memory Suppose leaves have avg. 200 keys  500 disk I/O in each B-tree to get pointer lists  (for intermediate B-tree level) disk I/O’s Total 11,002 disk I/O’s, more than sequential scan of file = 10,000 I/O’s.

Nearest Neighbor query using B-trees Turn NN to (10,20) into a range-query {(x,y):10-d  x  10+d, 20-d  y  20+d } Possible problem: 1.No point in the selected range 2.The closest point inside may not be the answer Solution: re-execute range query with slightly larger d

NN-queries, example Same relation Points and its indexes on x and y as before, and Query: NN to (10,20) Choose d = 1  range-query = {(x,y): 9  x  11, 19  y  21} 2000 points in [9,11], 2000 points in [19,21]  For each dimension, we pay 10+1 I/O’s to get pointers from the B-Tree leaves +1 is because points with x=9 may not start just at the beginning of the leaf Add an extra I/O for the intermediate node when finding the start of the range for each index Total disk I/O’s to get the answer, assuming 1 of the 4 points is the answer, which we can determine by their coordinates, prior to getting the data blocks holding the points However, if d is too small, we have to run another range query with a larger d

Grid files (hash-like structure) Data: (25,60) (45,60) (50,75) (50,100) (50,120) (70,110) (85,140) (30,260) (25,400) (45,350) (50,275) (60,260) Divide data into stripes in each dimension Each rectangle is a bucket Example: database records (age,salary) for people who buy gold jewelry.

Grid file

Operations Lookup Find coordinates of point in each dimension --- gives you a bucket to search. Nearest Neighbor Lookup point P. Consider points in its bucket. Problem: there could be points in adjacent buckets that are closer. Problem: there could be no points at all in the bucket: widen search? Range Queries Ranges define a region of buckets. Buckets on border may contain points not in range. Example: 35 < age <= 45; 50 < salary <= 100. Queries Specifying Only One Attribute Must search a whole row or column of buckets.

Insertion Use overflow buckets, or split stripes in one or more dimensions Insert (52,200).

Insertion Insert (52,200). Split central bucket, for instance by splitting central salary stripe (One possibility) Blocks of 3 buckets are to be processed. In general the blocks of n buckets are to be processed during a split.

Grid files Advantages Good for multiple-key search Supports Partial Match, Range Queries, NN queries Disadvantages Space management overhead Need partitioning ranges that evenly split keys Possibility of overflow buckets for insertion

Partitioned hashing I If we hash the concatenation of several keys then such a hash table cannot be used in queries specifying only one dimension (key). Instead create hash function h as a concatenation of n hash functions, one for each dimensional attribute. h = (h 1, …, h n ) the bucket where to put a tuple (v 1, …, v n ) is computed by concatenating the bit sequences h 1 (v 1 )…h n (v n ).

Partitioned hashing II Example: Gold jewelry with first bit: age mod 2 bits 2 and 3: salary mod 4 Partial match? Range? NN?

Partitioned hashing III Partial match query –specifying only the value of a: compute h age (a), which could be, say 1. Then, locate all the relevant buckets, which are from 100 to 111. –specifying only the value of salary: compute h salary (s), which could be, say 10. Then, locate the relevant buckets, which are 010 and 110. Bad for: range nearest neighbor queries

Grid files vs. partitioned hashing If many dimensions  many empty cells in grid. While partitioned hashing is OK. Both support exact and partial match queries. Grid files good for range and NN queries, while partitioned hashing is not at all.

Multiple-key indexes Index on one attribute provides pointer to an index on the other. Let V be a value of the first attribute. Then the index we reach by following the pointer for V is an index into the set of points that have V for their first value in the first attribute and any value for the second attribute.

“Who buys gold jewelry” (age and salary only). Raw data in age­salary pairs: (25; 60) (45; 60) (50; 75) (50; 100) (50; 120) (70; 110) (85; 140) (30; 260) (25; 400) (45; 350) (50; 275) (60; 260) Question: For what kinds of queries will a multiple­ key index (age first) significantly reduce the number of disk I/O's? Example The indexes can be organized as B-Trees.

Partial match queries If the first attribute is specified, then the access is quite efficient If the first attribute isn’t specified, then we have to search every sub- index. Range queries Quite well, provided the individual indexes themselves support range queries on their attribute (e.g. they are B-Trees) - Example. Range query is 35  age  55 AND 100  sal  200 NN queries Similar to range queries. Operations Also, the indexes should be “primary” ones if we want to support efficiently range queries.

KD-Trees Levels rotate among the dimensions, partitioning the points by comparison with a value for that dimension. Leaves are blocks holding the data records.

Geometrically… Remember we didn’t want the stripes in grid files to continue all along the vertical or horizontal direction. Here they don’t.

Operations Lookup in KD­Trees Find appropriate leaf by binary search. Is the record there? Insert Into KD­Trees Lookup record to be inserted, reaching the appropriate leaf. If there is room, put record in that block. If not, find a suitable value for the appropriate dimension and split the leaf block using the appropriate dimension. Example Someone 35 years old with a salary of $500K buys gold jewelry. Belongs in leaf with (25; 400) and (45; 350). Too full: split on age. See figure next.

It’s “age” turn to be used for split. Split at 35; it’s the median. Someone 35 years old with a salary of $500K buys gold jewelry.

Queries Partial match queries When we don’t know the value of the attribute at the node, we must explore both of its children. - E.g. find points with age=50 Range Queries Sometimes a range will allow us to move to only one child of a node. But if the range straddles the splitting value then we must explore both children.

KD-trees in secondary storage If internal nodes don’t fit in main memory group them into blocks.

Quad trees Nodes split at all dimensions at once For a quad tree of k dimensions, each interior node has 2 k children. j k fg l d a b c e i h Age h b i a c de g f k j Sal l Age 25, Sal 300 Age 50, Sal 200 Age 75, Sal 100

Why quad trees? k-dimensions  node has 2 k children, e.g. k=7  128 children. If 128, or 2 7, pointers can fit in a block, then k=7 is a convenient number of dimensions.

Quad­Tree Insert and Queries Insert Find leaf node in which new point belongs. If room, put it there. If not, make the leaf an interior node and give it leaves for each quadrant. Split the points among the new leaves. Problem: may make lots of null pointers, especially in high­dimensions. Quad­Tree Queries Single point queries: easy; just go down the tree to proper leaf. Range queries: varies by position of range. - Example: a range like 45<age<55; 180<salary<220 requires search of four leaves. Nearest neighbor: Problems and strategies similar to grid files.