Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.

Slides:



Advertisements
Similar presentations
Multidimensional Index Structures One dimensional index structures assume a single search key, and retrieve records that match a given search-key value.
Advertisements

External Memory Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Multidimensional Indexing
Hashing and Indexing John Ortiz.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
Multidimensional Data
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
COMP 451/651 B-Trees Size and Lookup Chapter 1.
BTrees & Bitmap Indexes
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 Geometric index structures April 15, 2004 Based on GUW Chapter , [Arge01] Sections 1, 2.1 (persistent B- trees), 3-4 (static versions.
1 Lecture 19: B-trees and Hash Tables Wednesday, November 12, 2003.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
COMP 451/651 Multiple-key indexes
CS 277 – Spring 2002Notes 51 CS 277: Database System Implementation Arthur Keller Notes 5: Hashing and More.
Primary Indexes Dense Indexes
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #9.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Mutlidimensional Indices Instructor: Randal Burns Lecture for 29 November 2005 Computer Science Johns Hopkins University.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
Lecture 5 Cost Estimation and Data Access Methods.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
1 CPS216: Data-intensive Computing Systems Operators for Data Access (contd.) Shivnath Babu.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Chapter 5 Multidimensional Indexes. One dimensional index can be used to support multidimensional query. F1=‘abcd’ F2= 123‘abcd#123’
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
1 Lecture 21: Hash Tables Wednesday, November 17, 2004.
Physical Database Design I, Ch. Eick 1 Physical Database Design I Chapter 16 Simple queries:= no joins, no complex aggregate functions Focus of this Lecture:
CS4432: Database Systems II
1 Ullman et al. : Database System Principles Notes 5: Hashing and More.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
CPSC 8620Notes 61 CPSC 8620: Database Management System Design Notes 6: Hashing and More.
1 Ullman et al. : Database System Principles Notes 4: Indexing.
Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –
Chapter 5. Multidimensional Indexes
CS 245: Database System Principles
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Multidimensional Access Structures
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Spatial Indexing I Point Access Methods.
COMP 430 Intro. to Database Systems
Yan Huang - CSCI5330 Database Implementation – Access Methods
Multidimensional Indexes
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Lecture 11: B+ Trees and Query Execution
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data about sales. - A sale is described by (store, day, item, color, size, etc.). Sale = point in 5­dim space. - A customer is described by (age, salary, pcode, marital­status, etc.). Typical Queries Range queries: ``How many customers for gold jewelry have age between 45 and 55, and salary less than 100K?'' Nearest neighbor : ``If I am at coordinates (x,y), what is the nearest McDonalds.'' They are expressible in SQL. Do you see how?

SQL Range queries: ``How many customers for gold jewelry have age between 45 and 55, and salary less than 100K?'‘ SELECT * FROM Customers WHERE age>=45 AND age<=55 AND sal<100; Nearest neighbor : ``If I am at coordinates (a,b), what is the nearest McDonalds.'‘ Suppose we have a relation Points(x,y,name) SELECT * FROM Points p WHERE p.name=‘McDonalds’ AND NOT EXISTS ( SELECT * FROM POINTS q WHERE (q.x-a)*(q.x-a)+(q.y-b)*(q.y-b) < (p.x-a)*(p.x-a)+(p.y-b)*(p.y-b) AND q.name=‘McDonalds’ );

Big Impediment For these types of queries, there is no clean way to eliminate lots of records that don't meet the condition of the WHERE­clause. An Approach Index on attributes independently. - Intersect pointers in main memory to save disk I/O.

Attempt at using B-trees for MD-queries Database = 1,000,000 points evenly distributed in a 1000×1000 square. Stored in 10,000 blocks (100 recs per block) B-tree indexes on x and on y Range query {(x,y) : 450  x  550, 450  y  550} 100,000 pointers (i.e. 1,000,000/10) for the x range, and same for y 10,000 pointers for answer (found by pointer intersection) Retrieve 10,000 records. If they are stored randomly we need to do 10,000 I/O’s. Add here the cost of B-Trees: Root of each B-tree in main memory Suppose leaves have avg. 200 keys  500 disk I/O in each B-tree to get pointer lists  (for intermediate B-tree level) disk I/O’s Total 11,002 disk I/O’s more than sequential scan of file = 10,000 I/O’s.

Nearest Neighbor query using B-trees Turn NN to (10,20) into a range-query {(x,y):10-d  x  10+d, 20-d  y  20+d } Possible problem: No point in the selected range The closest point inside may not be the answer Solution: re-execute range query with slightly larger d

NN-queries, example Same relation Points and its indexes on x and y as before, and Query: NN to (10,20) Choose d = 1  range-query = {(x,y): 9  x  11, 19  y  21} 2000 points in [9,11], same in [19,21]  each dimension = 10+1 I/O’s to get pointers (+1 is because points with x=9 may not start just at the beginning of the leaf) With an extra I/O for the intermediate node for each index  disk I/O’s to get the answer, assuming 1 of the 4 points is the answer, which we can determine by their coordinates, prior to getting the data blocks holding the points However, if d is too small, we have to run another range query with a larger d

Grid files (hash-like structure) Data: (25,60) (45,60) (50,75) (50,100) (50,120) (70,110) (85,140) (30,260) (25,400) (45,350) (50,275) (60,260) Divide data into stripes in each dimension Rectangle in grid points to bucket Example: database records (age,salary) for people who buy gold jewelry.

Grid file

Operations Lookup Find coordinates of point in each dimension --- gives you a bucket to search. Nearest Neighbor Lookup point P. Consider points in that bucket. Problem: there could be points in adjacent buckets that are closer. Problem: there could be no points at all in the bucket: widen search? Range Queries Ranges define a region of buckets. Buckets on border may contain points not in range. Example: 35 < age <= 45; 50 < salary <= 100. Queries Specifying Only One Attribute Problem: must search a whole row or column of buckets.

Insertion Use overflow buckets, or split stripes in one or more dimensions Insert (52,200). Split central bucket, for instance by splitting central salary stripe The blocks of 3 buckets are to be processed. In general the blocks of n buckets are to be processed during a split. n is the number of buckets in the chosen direction

Insertion Insert (52,200). Split central bucket, for instance by splitting central salary stripe (One possibility)

Grid files Advantages Good for multiple-key search Supports Partial Match, Range Queries, NN queries Disadvantages Space management overhead Need partitioning ranges that evenly split keys Possibility of overflow buckets for insertion

Partitioned hashing I If we hash the concatenation of several keys then such a hash table cannot be used in queries specifying only one dimension (key). A preferable option is to design the hash function so it produces some number of bits, say k. These k bits are divided among n attributes. I.e. the hash function h is a concatenation of n hash functions, one for each dimensional attribute. h = (h 1, …, h n ) the bucket where to put a tuple (v 1, …, v n ) is computed by concatenating the bit sequences h 1 (v 1 )…h n (v n ).

Partitioned hashing II Example: Gold jewelry with first bit = age mod 2 bits 2 and 3: salary mod 4 Works well for: partial match (i.e. just an attribute specified) Bad for: range nearest neighbor queries

Partitioned hashing III Partial match query –specifying only the value of a: compute h age (a), which could be, say 1. Then, locate all the relevant buckets, which are from 100 to 111. –specifying only the value of salary: compute h salary (s), which could be, say 10. Then, locate the relevant buckets, which are 010 and 110.

Grid files vs. partitioned hashing If many dimensions  many empty cells in grid. While partitioned hashing is OK. Both support exact and partial match queries. Grid files good for range and Nearest Neighbor queries, while partitioned hashing is not at all.