Presentation is loading. Please wait.

Presentation is loading. Please wait.

July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

Similar presentations


Presentation on theme: "July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani."— Presentation transcript:

1 July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani

2 July, 2001 The big picture gridstorage MPI-IOfile Request Interpreter dataset Data mining DistributedLarge

3 July, 2001 The big picture Request interpreter Logical request Qualified objects Request planning/execution Execution services grid LBNL PPDG MPI-IO, … Sub-task schedule

4 July, 2001 Problem statement Main objective: maps logical request to qualified objects —a logical request: 20001015<=eventTime & 200<energy<300 … —objects: set of object IDs; set of files containing the objects; offsets within the files, …

5 July, 2001 Requirements & Status General requirements —User request data in terms of their scientific domain, not file names or offsets in files —Each object may be described in hundreds of attributes —Each request is in terms of range predicates on a handful of attributes (partial range query) Status —Initially motivated by a HENP experiment: STAR —Software originally developed under GC and is currently in use at BNL

6 July, 2001 Large high-dimensional datasets Number of attributes / columns: 200 – 500 Number of objects / events: 10 8 – 10 9 File containing one attribute: 400MB – 4GB Total size over all attributes: 80GB – 2TB A1A2A3A4…Object ID 0 1 2...... 10 9 10 8...... Goal: develop an index, so that: Read as little as possible from disk Minimize computation in memory Curse of dimensionality

7 July, 2001 Well known indexing methods B-tree based indices —One or a small number of attributes —Index size may be up to 3 times the data size R-tree based indices —Small number of attributes, say, < 10 UB-tree —Use space filling curves to map high-dimensional data to one-dimension —One range query is mapped into many many queries on the B-tree based index Even sequential scan —Better than B-tree and R-tree if dimension > 10 —Simply read all data and compare  take too long

8 July, 2001 Another class of indexes: Bitmap index Example queries on the attribute, say, A One-sided range query: A < 2 —b 0 OR b 1 Two-sided range query: 2<A<5 —b 3 OR b 4 Basic steps of building a bitmap index —Binning —Encoding —Compressing Data values 015312041015312041 100000100100000100 010010001010010001 000001000000001000 000100000000100000 000000010000000010 001000000001000000 =0=1=2=3=4=5 b0b0 b1b1 b2b2 b3b3 b4b4 b5b5

9 July, 2001 How many bins? Range(x) Range(y) Edge bin...................................................... More bins Less objects in edge bins

10 July, 2001 How to encode Equality encoding Range encoding Interval encoding 6 bins 012345

11 July, 2001 Advantages of bitmap indices Fast operations —The most common operations are the bitwise logical operations —They are well supported by hardware Easy to compress, potentially small index size Each individual bitmap is small and frequently used ones can be cached in memory Efficient for read-mostly data: data produced from scientific experiments can be appended in large groups Available in most major commercial DBMS

12 July, 2001 Why our own bitmap index Early tests shown that we can do an order of magnitude better than ORACLE (using equality encoding) Vertical partition: allows one to only read data of the attributes involved in a query New compression method —Best known: Byte-aligned Bitmap Code (BBC) —Developed 2 Word-Aligned Schemes: WAH, WBC Different encoding schemes under compression —Equality encoding – used in ORACLE and others —Range encoding – one-sided range queries —Interval encoding – two-sided range queries

13 July, 2001 Information about the test machines Hardware and system —Sun enterprise 450 (Ultrasparc II 400MHz) —4GB RAM —VARITAS volume manager (stripped disk) Real application data from STAR —Above 2 million objects —Picked 12 attributes with varying distributions Measures: —Logical operation time without IO —Logical operation time with IO —Query processing time

14 July, 2001 Logical operation time (no IO)

15 July, 2001 Logical operation time (including IO)

16 July, 2001 New compression schemes Overall, use about 50% more space than BBC On average, 12 times faster than BBC Faster than the uncompressed in more cases: —New schemes are faster than the uncompressed scheme when the compression ratios are less than 0.3 —BBC is faster than the uncompressed when the compression ratios are less than 0.03

17 July, 2001 Sizes of bitmap indices Conclusion: - equality encoding is most space efficient - Compression gain is at least a factor of 2.5

18 July, 2001 Average query processing time Conclusion: - interval and range encoding are the best - For these cases, there is practically no penalty to compression

19 July, 2001 Interval encoding is better overall Sequential scan time: 0.557 sec

20 July, 2001 Summary Better compression scheme —50% more space, but 10-12 time faster !!! Among the different encoding schemes —the interval encoding is the better than the equality encoding and the range encoding Selecting the number of bins => Bitmap index size and operation efficiency. For example: —10% of data size => 3 x speed of sequential scan —20% of data size => 6 x speed of sequential scan Equality encoding currently used in the STAR experiment. Next version will include the interval encoding.

21 July, 2001 Future work Support NULL value and categorical values On-line update: add new data and update index without interrupting request processing Recovery mechanism for robustness Potential new applications: climate, astrophysics, biology Study different non-uniform binning strategies Integrate with conventional database system: to better handle metadata, to provide more versatile front-end


Download ppt "July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani."

Similar presentations


Ads by Google