Download presentation
Presentation is loading. Please wait.
1
Multidimensional Data
2
Many applications of databases are "geographic" = 2dimensional data. Others involve large numbers of dimensions. Example: data about sales. - A sale is described by (store, day, item, color, size, etc.). Sale = point in 5dim space. - A customer is described by (age, salary, pcode, maritalstatus, etc.). Typical Queries Range queries: "How many customers for gold jewelry have age between 45 and 55, and salary less than 100K?" Nearest neighbor : "If I am at coordinates (a,b), what is the nearest McDonalds." They are expressible in SQL. Do you see how?
3
SQL Range queries: “How many customers for gold jewelry have age between 45 and 55, and salary less than 100K?” SELECT * FROM Customers WHERE age>=45 AND age<=55 AND sal<100; Nearest neighbor : “If I am at coordinates (a,b), what is the nearest McDonalds.” Suppose we have a relation Points(x,y,name) SELECT * FROM Points p WHERE p.name=‘McDonalds’ AND NOT EXISTS ( SELECT * FROM POINTS q WHERE (q.x-a)*(q.x-a)+(q.y-b)*(q.y-b) < (p.x-a)*(p.x-a)+(p.y-b)*(p.y-b) AND q.name=‘McDonalds’ );
4
Big Impediment For these types of queries, there is no clean way to eliminate lots of records that don't meet the condition of the WHEREclause. An Approach for range queries Index on attributes independently. - Intersect pointers in main memory to save disk I/O.
5
Attempt at using B-trees for MD-queries Database = 1,000,000 points evenly distributed in a 1000×1000 square. Stored in 10,000 blocks (100 recs per block) B-tree secondary indexes on x and on y Range query {(x,y) : 450 x 550, 450 y 550} 100,000 pointers (i.e. 1,000,000/10) for the x range, and same for y 10,000 pointers for answer (found by pointer intersection) Retrieve 10,000 records. If they are stored randomly we need to do 10,000 I/O’s. Add here the cost of B-Trees: Root of each B-tree in main memory Suppose leaves have avg. 200 keys 500 disk I/O in each B-tree to get pointer lists 1000 + 2(for intermediate B-tree level) disk I/O’s Total 11,002 disk I/O’s, more than sequential scan of file = 10,000 I/O’s.
6
Nearest Neighbor query using B-trees Turn NN to (10,20) into a range-query {(x,y):10-d x 10+d, 20-d y 20+d } Possible problem: 1.No point in the selected range 2.The closest point inside may not be the answer Solution: re-execute range query with slightly larger d
7
NN-queries, example Same relation Points and its indexes on x and y as before, and Query: NN to (10,20) Choose d = 1 range-query = {(x,y): 9 x 11, 19 y 21} 2000 points in [9,11], 2000 points in [19,21] For each dimension, we pay 10+1 I/O’s to get pointers from the B-Tree leaves +1 is because points with x=9 may not start just at the beginning of the leaf Add an extra I/O for the intermediate node when finding the start of the range for each index Total 24 + 1 disk I/O’s to get the answer, assuming 1 of the 4 points is the answer, which we can determine by their coordinates, prior to getting the data blocks holding the points However, if d is too small, we have to run another range query with a larger d
8
Grid files (hash-like structure) Data: (25,60) (45,60) (50,75) (50,100) (50,120) (70,110) (85,140) (30,260) (25,400) (45,350) (50,275) (60,260) Divide data into stripes in each dimension Each rectangle is a bucket Example: database records (age,salary) for people who buy gold jewelry.
9
Grid file
10
Operations Lookup Find coordinates of point in each dimension --- gives you a bucket to search. Nearest Neighbor Lookup point P. Consider points in its bucket. Problem: there could be points in adjacent buckets that are closer. Problem: there could be no points at all in the bucket: widen search? Range Queries Ranges define a region of buckets. Buckets on border may contain points not in range. Example: 35 < age <= 45; 50 < salary <= 100. Queries Specifying Only One Attribute Must search a whole row or column of buckets.
11
Insertion Use overflow buckets, or split stripes in one or more dimensions Insert (52,200).
12
Insertion Insert (52,200). Split central bucket, for instance by splitting central salary stripe (One possibility) Blocks of 3 buckets are to be processed. In general the blocks of n buckets are to be processed during a split.
13
Grid files Advantages Good for multiple-key search Supports Partial Match, Range Queries, NN queries Disadvantages Space management overhead Need partitioning ranges that evenly split keys Possibility of overflow buckets for insertion
14
Partitioned hashing I If we hash the concatenation of several keys then such a hash table cannot be used in queries specifying only one dimension (key). Instead create hash function h as a concatenation of n hash functions, one for each dimensional attribute. h = (h 1, …, h n ) the bucket where to put a tuple (v 1, …, v n ) is computed by concatenating the bit sequences h 1 (v 1 )…h n (v n ).
15
Partitioned hashing II Example: Gold jewelry with first bit: age mod 2 bits 2 and 3: salary mod 4 Partial match? Range? NN?
16
Partitioned hashing III Partial match query –specifying only the value of a: compute h age (a), which could be, say 1. Then, locate all the relevant buckets, which are from 100 to 111. –specifying only the value of salary: compute h salary (s), which could be, say 10. Then, locate the relevant buckets, which are 010 and 110. Bad for: range nearest neighbor queries
17
Grid files vs. partitioned hashing If many dimensions many empty cells in grid. While partitioned hashing is OK. Both support exact and partial match queries. Grid files good for range and NN queries, while partitioned hashing is not at all.
18
Multiple-key indexes Index on one attribute provides pointer to an index on the other. Let V be a value of the first attribute. Then the index we reach by following the pointer for V is an index into the set of points that have V for their first value in the first attribute and any value for the second attribute.
19
“Who buys gold jewelry” (age and salary only). Raw data in agesalary pairs: (25; 60) (45; 60) (50; 75) (50; 100) (50; 120) (70; 110) (85; 140) (30; 260) (25; 400) (45; 350) (50; 275) (60; 260) Question: For what kinds of queries will a multiple key index (age first) significantly reduce the number of disk I/O's? Example The indexes can be organized as B-Trees.
20
Partial match queries If the first attribute is specified, then the access is quite efficient If the first attribute isn’t specified, then we have to search every sub- index. Range queries Quite well, provided the individual indexes themselves support range queries on their attribute (e.g. they are B-Trees) - Example. Range query is 35 age 55 AND 100 sal 200 NN queries Similar to range queries. Operations Also, the indexes should be “primary” ones if we want to support efficiently range queries.
21
KD-Trees Levels rotate among the dimensions, partitioning the points by comparison with a value for that dimension. Leaves are blocks holding the data records.
22
Geometrically… Remember we didn’t want the stripes in grid files to continue all along the vertical or horizontal direction. Here they don’t.
23
Operations Lookup in KDTrees Find appropriate leaf by binary search. Is the record there? Insert Into KDTrees Lookup record to be inserted, reaching the appropriate leaf. If there is room, put record in that block. If not, find a suitable value for the appropriate dimension and split the leaf block using the appropriate dimension. Example Someone 35 years old with a salary of $500K buys gold jewelry. Belongs in leaf with (25; 400) and (45; 350). Too full: split on age. See figure next.
24
It’s “age” turn to be used for split. Split at 35; it’s the median. Someone 35 years old with a salary of $500K buys gold jewelry.
25
Queries Partial match queries When we don’t know the value of the attribute at the node, we must explore both of its children. - E.g. find points with age=50 Range Queries Sometimes a range will allow us to move to only one child of a node. But if the range straddles the splitting value then we must explore both children.
26
KD-trees in secondary storage If internal nodes don’t fit in main memory group them into blocks.
27
Quad trees Nodes split at all dimensions at once For a quad tree of k dimensions, each interior node has 2 k children. j k fg l d a b c e i h Age 400 100 0 h b i a c de g f k j Sal l Age 25, Sal 300 Age 50, Sal 200 Age 75, Sal 100
28
Why quad trees? k-dimensions node has 2 k children, e.g. k=7 128 children. If 128, or 2 7, pointers can fit in a block, then k=7 is a convenient number of dimensions.
29
QuadTree Insert and Queries Insert Find leaf node in which new point belongs. If room, put it there. If not, make the leaf an interior node and give it leaves for each quadrant. Split the points among the new leaves. Problem: may make lots of null pointers, especially in highdimensions. QuadTree Queries Single point queries: easy; just go down the tree to proper leaf. Range queries: varies by position of range. - Example: a range like 45<age<55; 180<salary<220 requires search of four leaves. Nearest neighbor: Problems and strategies similar to grid files.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.