Download presentation
Presentation is loading. Please wait.
Published byAleesha Carr Modified over 9 years ago
1
Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das, Divyakant Agrawal, Amr El Abbadi (University of California, Santa Barbara) * Work done as a visiting researcher at UCSB
2
Page 2 Overview ▐A Motivating Story ▐Existing Technologies ▐Our proposal ▐Evaluation ▐Conclusion
3
Page 3 Motivating Scenario: Mobile Coupon Distribution Coupon Current Location Current Location Current Location Distribution Policy Area # of coupons Mobile Coupon Distributer
4
Page 4 Motivating Scenario: Mobile Coupon Distribution Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Distribution Policy Area # of coupons Coupon Large amounts of Data High Throughput System Scalability Multi-Dimensional Query Nearest Neighbors Query Efficient Complex Queries 125,000,000 subscribers in Japan
5
Page 5 Existing Technologies Multi- dimensional Queries Scalability Relational DBs Spatial DBs Commercial products but expensive Open source products Key-Value Stores What We Want at a reasonable price
6
Page 6 Ordered Key-Value Stores key00 key11 keynn key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y keynn valuenn Index Buckets Sorted by key Good at 1-D Range Query Longitude Time Latitude But, our target is multi-dimensional…
7
Page 7 Naïve Solution: Linearlization key00 key11 keynn key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y keynn valuenn Projects n-D space to 1-D space Simple, but problematic… Apply a Z-ordering curve… 571315 461214 13911 02810
8
Page 8 Problem: False positive scans ▐MD-query on Linearized space Translate a MD-query to linearized range query. Ex. Query from 2 to 9. Scan queried linearized range. Filter points out of the queried area. ex. blue-hatched area (4 to 7) Require the boundary information of the original space. 571315 461214 13911 02810 2 9
9
Page 9 Build a Multi-dimensional Index Layer on top of an Ordered Key- Value store Our Approach: MD-HBase Single Dimensional Index Multi-Dimensional Index Ordered Key-Value Store ex. BigTable, HBase, … MD-HBase
10
Page 10 Introduce Multi-dimensional Index ▐Multi-dimensional Index (ex. The K-d tree, The Quad tree) Divide a space into subspaces containing almost same # of points Organize subspaces as tree Efficient subspace pruning → to avoid false positive scans Divide into Organize as
11
Page 11 Space Partition By the K-d tree 01010101011101111101110111111111 01000100011001101100110011101110 00010001001100111001100110111011 00000000001000101000100010101010 Binary Z-ordering space 00 01 10 11 11 10 01 00 0101011111011111 0100011011001110 0001001110011011 0000001010001010 00 01 10 11 11 10 01 00 Partitioned space by the K-d tree How do we represent these subspaces? bitwise interleaving
12
Page 12 Key Idea: The longest common prefix naming scheme 0101011111011111 0100011011001110 0001001110011011 0000001010001010 00 01 10 11 11 10 01 00 000* 1*** Subspaces represented as the longest common prefix of keys! Remarkable Property Preserve boundary information of the original space 1*** Left-bottom corner Right-top corner 1000100011111111 *→0 *→1 (10, 00)(11, 11)
13
Page 13 Build an index with the longest common prefix of keys 0101011111011111 0100011011001110 0001001110011011 0000001010001010 00 01 10 11 11 10 01 00 000*001* 01** 1*** 000* 001* 01** 1*** Index Buckets allocate per subspace
14
Page 14 Reconstruct the boundary Info. & Check whether intersecting the queried area Multi-dimensional Range Query 0101011111011111 0100011011001110 0001001110011011 0000001010001010 00 01 10 11 11 10 01 00 000* 001* 01** 10** 11** Index Filter 001* 000* 001* 10** 11** 01** 10** Scan Subspace Pruning Scan 0010 -1001 on the index
15
Page 15 K Nearest Neighbors Query ▐The best first algorithm can be applied. the most efficient technique in practical case ▐Check the detail in our paper 12 4 3 5
16
Variations of Storage Layer Table Share Model Use single table, Maintain bucket boundary Most space efficiency Monitor Table per Bucket Model Allocate a table per bucket Most flexible mapping One-to-one, one-to-many, many-to-one Bucket split is expensive Copy all points to the new buckets. Region per Bucket Model Allocate a region per bucket Most bucket split efficiency Asynchronous bucket split Require modification of HBase
17
Page 17 Experimental Results: Multi-dimensional Range Query Dataset: 400,000,000 points Queries: select objects within MD ranges and change selectivity Cluster size: 16 nodes MD-HBase responses 10~100 times faster than others and responses proportional time to selectivity.
18
Page 18 Experimental Results: k Nearest Neighbors Query Dataset: 400,000,000 points Queries: choose a point and change the number of neighbors Cluster size: 16 nodes MD-HBase responses 1.5 sec where k ≦ 100, and 11 sec even if k = 10,000
19
Page 19 Experimental Results: Insert Dataset: spatially skewed data generated by zipfian distribution MD-HBase shows good scalability without significant overhead.
20
Page 20 Conclusions Designed a scalable multi-dimensional data store. Scalability & Efficient multi-dimensional queries Key Idea: indexing the longest common prefix of keys Easily extend general ordered key-value stores. Demonstrated scalable insert throughput and excellent query performance. Range Query: 10-100 times faster than existing technologies. kNN Query: 1.5 s when k ≦ 100. Insert: 220K inserts/sec on 16 nodes cluster without overhead Thank you. Any Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.