Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Slides:

Advertisements

Similar presentations

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

1 Spatial Join. 2 Papers to Present “Efficient Processing of Spatial Joins using R-trees”, T. Brinkhoff, H-P Kriegel and B. Seeger, Proc. SIGMOD, 1993.

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

C-Store: Self-Organizing Tuple Reconstruction Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 17, 2009.

Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.

Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.

Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.

Access Methods for Advanced Database Applications.

Index for Cloud Data Management Lab of Web And Mobile Data Management （ WAMDM ） Youzhong MA.

Searching on Multi-Dimensional Data

An Efficient Multi-Dimensional Index for Cloud Data Management Xiangyu Zhang Jing Ai Zhongyuan Wang Jiaheng Lu Xiaofeng Meng School of Information Renmin.

Yoshiharu Ishikawa (Nagoya University) Yoji Machida (University of Tsukuba) Hiroyuki Kitagawa (University of Tsukuba) A Dynamic Mobility Histogram Construction.

Spatial Indexing I Point Access Methods. PAMs Point Access Methods Multidimensional Hashing: Grid File Exponential growth of the directory Hierarchical.

1 One Torus to Rule Them All: Multi-dimensional Queries in P2P Systems Prasanna Ganesan Beverly Yang Hector Garcia-Molina Stanford University.

Subscription Subsumption Evaluation for Content-Based Publish/Subscribe Systems Hojjat Jafarpour, Bijit Hore, Sharad Mehrotra, and Nalini Venkatasubramanian.

B+-tree and Hashing.

Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.

Spatial Indexing I Point Access Methods.

Privacy and Integrity Preserving in Distributed Systems Presented for Ph.D. Qualifying Examination Fei Chen Michigan State University August 25 th, 2009.

XtreemOS IP project is funded by the European Commission under contract IST-FP XtreemOS WP3.2 - T3.2.3 Scalable Directory Service Design State.

Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.

Efficient Join Processing over Uncertain Data - By Reynold Cheng, et all. Presented By Lydia & Usha.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

Spatial Indexing. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

Module 04: Algorithms Topic 07: Instance-Based Learning

1 Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs Department of Computer Science and Information Engineering National Cheng.

Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,

Mutlidimensional Indices Instructor: Randal Burns Lecture for 29 November 2005 Computer Science Johns Hopkins University.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Project 2 Presentation & Demo Course: Distributed Systems By Pooja Singhal 11/22/

SpatialHadoop:A MapReduce Framework

Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.

Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.

Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,

Reporter ： Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.

Spatial Database 2/5/2011 Reference – Ramakrishna Gerhke and Silbershatz.

Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.

Database Systems Laboratory The Pyramid-Technique: Towards Breaking the Curse of Dimensionality Stefan Berchtold, Christian Bohm, and Hans-Peter Kriegal.

Data Indexing in Peer- to-Peer DHT Networks Garces-Erice, P.A.Felber, E.W.Biersack, G.Urvoy-Keller, K.W.Ross ICDCS 2004.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

A configuration method for structured P2P overlay network considering delay variations Tomoya KITANI (Shizuoka Univ. 、 Japan) Yoshitaka NAKAMURA (NAIST,

A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University.

Multi-dimensional Range Query Processing on the GPU Beomseok Nam Date Intensive Computing Lab School of Electrical and Computer Engineering Ulsan National.

Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –

Indexing Multidimensional Data

Spatial Data Management

Presented by: Omar Alqahtani Fall 2016

A Case Study in Building Layered DHT Applications

Tian Xia and Donghui Zhang Northeastern University

New Indices for Text : Pat Trees and PAT Arrays

Data Science Algorithms: The Basic Methods

Multidimensional Access Structures

Spatial Indexing I Point Access Methods.

COMP 430 Intro. to Database Systems

The Quad tree The index is represented as a quaternary tree

Query Processing in Databases Dr. M. Gavrilova

SpatialHadoop: A MapReduce Framework for Spatial Data

Dynamic Indexing in SpatialHadoop

On Spatial Joins in MapReduce

Communication and Memory Efficient Parallel Decision Tree Construction

Multidimensional Indexes

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

Efficient Processing of Top-k Spatial Preference Queries

Efficient Aggregation over Objects with Extent

Presentation transcript:

Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das, Divyakant Agrawal, Amr El Abbadi (University of California, Santa Barbara) * Work done as a visiting researcher at UCSB

Page 2 Overview ▐A Motivating Story ▐Existing Technologies ▐Our proposal ▐Evaluation ▐Conclusion

Page 3 Motivating Scenario: Mobile Coupon Distribution Coupon Current Location Current Location Current Location Distribution Policy Area # of coupons Mobile Coupon Distributer

Page 4 Motivating Scenario: Mobile Coupon Distribution Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Current Location Distribution Policy Area # of coupons Coupon Large amounts of Data High Throughput System Scalability Multi-Dimensional Query Nearest Neighbors Query Efficient Complex Queries 125,000,000 subscribers in Japan

Page 5 Existing Technologies Multi- dimensional Queries Scalability Relational DBs Spatial DBs Commercial products but expensive Open source products Key-Value Stores What We Want at a reasonable price

Page 6 Ordered Key-Value Stores key00 key11 keynn key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y keynn valuenn Index Buckets Sorted by key Good at 1-D Range Query Longitude Time Latitude But, our target is multi-dimensional…

Page 7 Naïve Solution: Linearlization key00 key11 keynn key00 key01 key0X value00 value01 value0X key11 key12 key1Y value11 value12 value1Y keynn valuenn Projects n-D space to 1-D space Simple, but problematic… Apply a Z-ordering curve…

Page 8 Problem: False positive scans ▐MD-query on Linearized space Translate a MD-query to linearized range query. Ex. Query from 2 to 9. Scan queried linearized range. Filter points out of the queried area. ex. blue-hatched area (4 to 7) Require the boundary information of the original space

Page 9 Build a Multi-dimensional Index Layer on top of an Ordered Key- Value store Our Approach: MD-HBase Single Dimensional Index Multi-Dimensional Index Ordered Key-Value Store ex. BigTable, HBase, … MD-HBase

Page 10 Introduce Multi-dimensional Index ▐Multi-dimensional Index (ex. The K-d tree, The Quad tree) Divide a space into subspaces containing almost same # of points Organize subspaces as tree Efficient subspace pruning → to avoid false positive scans Divide into Organize as

Page 11 Space Partition By the K-d tree Binary Z-ordering space Partitioned space by the K-d tree How do we represent these subspaces? bitwise interleaving

Page 12 Key Idea: The longest common prefix naming scheme * 1*** Subspaces represented as the longest common prefix of keys! Remarkable Property Preserve boundary information of the original space 1*** Left-bottom corner Right-top corner *→0 *→1 (10, 00)(11, 11)

Page 13 Build an index with the longest common prefix of keys *001* 01** 1*** 000* 001* 01** 1*** Index Buckets allocate per subspace

Page 14 Reconstruct the boundary Info. & Check whether intersecting the queried area Multi-dimensional Range Query * 001* 01** 10** 11** Index Filter 001* 000* 001* 10** 11** 01** 10** Scan Subspace Pruning Scan on the index

Page 15 K Nearest Neighbors Query ▐The best first algorithm can be applied. the most efficient technique in practical case ▐Check the detail in our paper

Variations of Storage Layer Table Share Model Use single table, Maintain bucket boundary Most space efficiency Monitor Table per Bucket Model Allocate a table per bucket Most flexible mapping  One-to-one, one-to-many, many-to-one Bucket split is expensive  Copy all points to the new buckets. Region per Bucket Model Allocate a region per bucket Most bucket split efficiency Asynchronous bucket split Require modification of HBase

Page 17 Experimental Results: Multi-dimensional Range Query Dataset: 400,000,000 points Queries: select objects within MD ranges and change selectivity Cluster size: 16 nodes MD-HBase responses 10~100 times faster than others and responses proportional time to selectivity.

Page 18 Experimental Results: k Nearest Neighbors Query Dataset: 400,000,000 points Queries: choose a point and change the number of neighbors Cluster size: 16 nodes MD-HBase responses 1.5 sec where k ≦ 100, and 11 sec even if k = 10,000

Page 19 Experimental Results: Insert Dataset: spatially skewed data generated by zipfian distribution MD-HBase shows good scalability without significant overhead.

Page 20 Conclusions Designed a scalable multi-dimensional data store. Scalability & Efficient multi-dimensional queries Key Idea: indexing the longest common prefix of keys Easily extend general ordered key-value stores. Demonstrated scalable insert throughput and excellent query performance. Range Query: times faster than existing technologies. kNN Query: 1.5 s when k ≦ 100. Insert: 220K inserts/sec on 16 nodes cluster without overhead Thank you. Any Questions?