Download presentation
Presentation is loading. Please wait.
Published byJoella Perkins Modified over 9 years ago
1
iMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT- SIGART 19th Symposium on Principles of Database Systems (PODS), 166-174, (2000).
2
Window/Range query: Retrieve data points fall within a given range along each dimension. Designed to support range retrieval, facilitate joins and similarity search (if applicable). Query Requirement
3
Increase Fan-Out by increasing the node size, as in X-trees. --- reduce indexing to semi-sequential scan? Increase Fan-Out by using approximation in the nodes, as in A-trees – expensive update. Mapping high-dimensional points into single-dimensional points by using space-filling curves methods. –Transformation is expensive. –The number of sub-queries generated can be very big. Sequential scan needs to search the whole data file -- affected volume is 100%. Strategies
4
Pyramid T. leads to 2d sub-queries on B +- tree. 2-dimensional example Pyramid Pyramid Technique
5
Map objects to approximate bit strings. Eg. (0.1, 0.8) ( 00, 11) Basic Idea: divide data space to 2 b cells, each with a representation (bit string). Scan whole signature file sequentially Weakness: lost precision of data and hence affect the size of query window Eg. (0.2, 0.8) (00, 11), which is as same as (0.0, 1.0) VA-file VA-file (Vector Approximation file)
6
‘Edge’ --- the max/min attribute of data point, which is also closer to the data space edge, comparing with other attributes. A data point whose “edge” not included in a query range is not an answer. Consider unit data space ([0,1], [0,1] … [0,1]) e.g. point A (0.1, 0.6), edge = 0.1; point B (0.6, 0.9), edge = 0.9; B (0.6, 0.9) A (0.1, 0.6) iMinMax -- Basic Concept
7
Indexing points using one of their attributes 01 123dd+1 Basic Concept
8
The probability of finding an attribute with very large (small) value increases with dimensionality. Eg. In 2-dim space [0..1], P(x i > 0.9) = 0.19 In 30-dim space, P(x i > 0.9) = 0.958 (Uniform distribution) However, not all queries will search for large values. ---- U se this fact to “prune” away data points that are of no interest !
9
Using the max./min. attribute to build the index Only at most d sub-queries needed Algorithm is very simple. data point: range query: e.g. iMax key. sub-query is. * Similar arguments apply for iMin. y x x>y y>x iMax iMax or iMin
10
Points A(0.2,0.5) and B(0.87,0.25) in 2-dimensional space. Query ( [0.26,0.75], [0.13,0.6] ). query A B iMax 1.5 0.87 iMin 0.2 1.25 iMinMax 0.2 0.87 sub-queries [0.26,0.75], [1.26,1.6] [0.26,0.6], [1.13,1.6] [0.26,0.75], [1.13,1.6] iMinMax -- Examples
11
Algorithm of iMinMax is still very simple. ---- iMinMax key: sub-query: For d-dimensional space, at most d sub-queries are needed. The union of of all answers of sub- queries yields the total answer. iMinMax’s Principle
12
Operations Range search: a query is transformed into d sub-queries, and for each, a normal B + -tree range search is performed. Point search: an attribute is selected based iMinMax criteria, and an exact match search is performed. Update: similar to those of B + -trees. B + -tree
13
(a) 2-dim. uniform data space(b) iMinMax keys of 30-dim. uniform data set (show any 2 dimensions) Data Distributions -- Uniform Distribution
14
(a) 2-dim. normal skewed data space (mean=0.6) (b) iMinMax keys of 30-dim. normally skewed data set Data Distributions – Normal Skewed Distribution
15
(a) 2-dim. exponential skewed data space (b) iMinMax keys of 30-dim. exponentially skewed data set Data Distributions -- Exponential Distribution
16
Introduce to tune iMinMax for better performance. iMinMax( ) key: E.g. Set = 0.2, Point (0.1, 0.8), Query ([0, 0.6], [0.1,0.7]) key sub-queries iMinMax 0.1 [0, 0.6], [1.1, 1.7] checked iMinMax( ) 1.8 [0, 0.6], [1.1, 1.7] not checked sub-query is still: ( independent of ) iMinMaxPrinciple iMinMax( )’s Principle
17
Default data distribution: uniform on [0..1] Default query selectivity: 0.1% Default data set: 500K Performance Study
18
Normal distribution Exponential distribution Data size 100K, 500K 500K Dimension 30 30 Query side 0.4 0.4 -0.3 ~ 0.3 (0.5) -0.1 ~ 0.4 * The distribution of query range is the same as the data set. * Data sets skewed with different degree, the tuning effects are different. * The tuning ‘knob’, , enables iMinMax to scatter the skewed data points to reduce false drops. Performance Study -- iMinMax iMinMax( ) on skewed data
19
performance gain is up to 66% data set size = 100K data set size = 500K skewed normal distribution Performance of iMinMax Performance of iMinMax( )
20
skewed exponential distribution data size = 500K Performance of iMinMax Performance of iMinMax( )
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.