Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das, Divyakant Agrawal, Amr El Abbadi (University of California, Santa Barbara) * Work done as a visiting researcher at UCSB
Page 2 Overview ▐A Motivating Story ▐Existing Technologies ▐Our proposal ▐Evaluation ▐Conclusion
Page 3 Motivating Scenario: Mobile Coupon Distribution Coupon Current Location Current Location Current Location Distribution Policy Area # of coupons Mobile Coupon Distributer
Page 4 Motivating Scenario: Mobile Coupon Distribution Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Distribution Policy Area # of coupons Coupon Large amounts of Data High Throughput System Scalability Multi-Dimensional Query Nearest Neighbors Query Efficient Complex Queries 125,000,000 subscribers in Japan
Page 5 Existing Technologies Multi- dimensional Queries Scalability Relational DBs Spatial DBs Commercial products but expensive Open source products Key-Value Stores What We Want at a reasonable price
Page 6 Ordered Key-Value Stores key00 key11 keynn key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y keynn valuenn Index Buckets Sorted by key Good at 1-D Range Query Longitude Time Latitude But, our target is multi-dimensional…
Page 7 Naïve Solution: Linearlization key00 key11 keynn key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y keynn valuenn Projects n-D space to 1-D space Simple, but problematic… Apply a Z-ordering curve…
Page 8 Problem: False positive scans ▐MD-query on Linearized space Translate a MD-query to linearized range query. Ex. Query from 2 to 9. Scan queried linearized range. Filter points out of the queried area. ex. blue-hatched area (4 to 7) Require the boundary information of the original space
Page 9 Build a Multi-dimensional Index Layer on top of an Ordered Key- Value store Our Approach: MD-HBase Single Dimensional Index Multi-Dimensional Index Ordered Key-Value Store ex. BigTable, HBase, … MD-HBase
Page 10 Introduce Multi-dimensional Index ▐Multi-dimensional Index (ex. The K-d tree, The Quad tree) Divide a space into subspaces containing almost same # of points Organize subspaces as tree Efficient subspace pruning → to avoid false positive scans Divide into Organize as
Page 11 Space Partition By the K-d tree Binary Z-ordering space Partitioned space by the K-d tree How do we represent these subspaces? bitwise interleaving
Page 12 Key Idea: The longest common prefix naming scheme * 1*** Subspaces represented as the longest common prefix of keys! Remarkable Property Preserve boundary information of the original space 1*** Left-bottom corner Right-top corner *→0 *→1 (10, 00)(11, 11)
Page 13 Build an index with the longest common prefix of keys *001* 01** 1*** 000* 001* 01** 1*** Index Buckets allocate per subspace
Page 14 Reconstruct the boundary Info. & Check whether intersecting the queried area Multi-dimensional Range Query * 001* 01** 10** 11** Index Filter 001* 000* 001* 10** 11** 01** 10** Scan Subspace Pruning Scan on the index
Page 15 K Nearest Neighbors Query ▐The best first algorithm can be applied. the most efficient technique in practical case ▐Check the detail in our paper
Variations of Storage Layer Table Share Model Use single table, Maintain bucket boundary Most space efficiency Monitor Table per Bucket Model Allocate a table per bucket Most flexible mapping One-to-one, one-to-many, many-to-one Bucket split is expensive Copy all points to the new buckets. Region per Bucket Model Allocate a region per bucket Most bucket split efficiency Asynchronous bucket split Require modification of HBase
Page 17 Experimental Results: Multi-dimensional Range Query Dataset: 400,000,000 points Queries: select objects within MD ranges and change selectivity Cluster size: 16 nodes MD-HBase responses 10~100 times faster than others and responses proportional time to selectivity.
Page 18 Experimental Results: k Nearest Neighbors Query Dataset: 400,000,000 points Queries: choose a point and change the number of neighbors Cluster size: 16 nodes MD-HBase responses 1.5 sec where k ≦ 100, and 11 sec even if k = 10,000
Page 19 Experimental Results: Insert Dataset: spatially skewed data generated by zipfian distribution MD-HBase shows good scalability without significant overhead.
Page 20 Conclusions Designed a scalable multi-dimensional data store. Scalability & Efficient multi-dimensional queries Key Idea: indexing the longest common prefix of keys Easily extend general ordered key-value stores. Demonstrated scalable insert throughput and excellent query performance. Range Query: times faster than existing technologies. kNN Query: 1.5 s when k ≦ 100. Insert: 220K inserts/sec on 16 nodes cluster without overhead Thank you. Any Questions?