IMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT-

iMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT- SIGART 19th Symposium on Principles of Database Systems (PODS), 166-174, (2000).

Window/Range query: Retrieve data points fall within a given range along each dimension. Designed to support range retrieval, facilitate joins and similarity search (if applicable). Query Requirement

Increase Fan-Out by increasing the node size, as in X-trees. --- reduce indexing to semi-sequential scan? Increase Fan-Out by using approximation in the nodes, as in A-trees – expensive update. Mapping high-dimensional points into single-dimensional points by using space-filling curves methods. –Transformation is expensive. –The number of sub-queries generated can be very big. Sequential scan needs to search the whole data file -- affected volume is 100%. Strategies

Pyramid T. leads to 2d sub-queries on B +- tree. 2-dimensional example Pyramid Pyramid Technique

Map objects to approximate bit strings. Eg. (0.1, 0.8) ( 00, 11) Basic Idea: divide data space to 2 b cells, each with a representation (bit string). Scan whole signature file sequentially Weakness: lost precision of data and hence affect the size of query window Eg. (0.2, 0.8) (00, 11), which is as same as (0.0, 1.0) VA-file VA-file (Vector Approximation file)

‘Edge’ --- the max/min attribute of data point, which is also closer to the data space edge, comparing with other attributes. A data point whose “edge” not included in a query range is not an answer. Consider unit data space ([0,1], [0,1] … [0,1]) e.g. point A (0.1, 0.6), edge = 0.1; point B (0.6, 0.9), edge = 0.9; B (0.6, 0.9) A (0.1, 0.6) iMinMax -- Basic Concept

Indexing points using one of their attributes 01 123dd+1 Basic Concept

The probability of finding an attribute with very large (small) value increases with dimensionality. Eg. In 2-dim space [0..1], P(x i > 0.9) = 0.19 In 30-dim space, P(x i > 0.9) = 0.958 (Uniform distribution) However, not all queries will search for large values. ---- U se this fact to “prune” away data points that are of no interest !

Using the max./min. attribute to build the index Only at most d sub-queries needed Algorithm is very simple. data point: range query: e.g. iMax key. sub-query is. * Similar arguments apply for iMin. y x x>y y>x iMax iMax or iMin

Points A(0.2,0.5) and B(0.87,0.25) in 2-dimensional space. Query ( [0.26,0.75], [0.13,0.6] ). query A B iMax 1.5 0.87 iMin 0.2 1.25 iMinMax 0.2 0.87 sub-queries [0.26,0.75], [1.26,1.6] [0.26,0.6], [1.13,1.6] [0.26,0.75], [1.13,1.6] iMinMax -- Examples

Algorithm of iMinMax is still very simple. ---- iMinMax key: sub-query: For d-dimensional space, at most d sub-queries are needed. The union of of all answers of sub- queries yields the total answer. iMinMax’s Principle

Operations Range search: a query is transformed into d sub-queries, and for each, a normal B + -tree range search is performed. Point search: an attribute is selected based iMinMax criteria, and an exact match search is performed. Update: similar to those of B + -trees. B + -tree

(a) 2-dim. uniform data space(b) iMinMax keys of 30-dim. uniform data set (show any 2 dimensions) Data Distributions -- Uniform Distribution

(a) 2-dim. normal skewed data space (mean=0.6) (b) iMinMax keys of 30-dim. normally skewed data set Data Distributions – Normal Skewed Distribution

(a) 2-dim. exponential skewed data space (b) iMinMax keys of 30-dim. exponentially skewed data set Data Distributions -- Exponential Distribution

Introduce  to tune iMinMax for better performance. iMinMax(  ) key: E.g. Set  = 0.2, Point (0.1, 0.8), Query ([0, 0.6], [0.1,0.7]) key sub-queries iMinMax 0.1 [0, 0.6], [1.1, 1.7] checked iMinMax(  ) 1.8 [0, 0.6], [1.1, 1.7] not checked sub-query is still: ( independent of  ) iMinMaxPrinciple iMinMax(  )’s Principle

Default data distribution: uniform on [0..1] Default query selectivity: 0.1% Default data set: 500K Performance Study

Normal distribution Exponential distribution Data size 100K, 500K 500K Dimension 30 30 Query side 0.4 0.4  -0.3 ~ 0.3 (0.5) -0.1 ~ 0.4 * The distribution of query range is the same as the data set. * Data sets skewed with different degree, the tuning effects are different. * The tuning ‘knob’, , enables iMinMax to scatter the skewed data points to reduce false drops. Performance Study -- iMinMax iMinMax(  ) on skewed data

performance gain is up to 66% data set size = 100K data set size = 500K skewed normal distribution Performance of iMinMax Performance of iMinMax(  )

skewed exponential distribution data size = 500K Performance of iMinMax Performance of iMinMax(  )

IMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT-

Similar presentations

Presentation on theme: "IMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT-"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT-

Similar presentations

Presentation on theme: "IMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT-"— Presentation transcript:

Similar presentations

About project

Feedback