Elasticity in SciDB DBMS Team Members Gunjan Sharma(MT15015) Hiya popli(MT15020)
An Introduction to SciDB DBMS SciDB is an array-based parallel DBMS oriented toward science applications. The data in such a DBMS has the following characteristics:- Array Data Model- Most science applications like earth science data, astronomy telescope etc. as well as most of the analytics that the scientists run are fundamentally array oriented and cannot fit into the relational data model. Sparse or Dense Array- Some arrays have values in each cell(like cooked satellite images) whereas in some cases(like raw satellite imagery) the data is really sparse. Skewed Data- It is very common in science applications for some regions of array space to have substantially more data than others like when storing resident data for a region. Visualization Focus- Scientists usually want a visualization system through which they can browse and inspect substantial amounts of data of interest.
Elasticity A science DBMS should support both data elasticity and processing elasticity without extensive downtime i.e. it is accomplished in background. Why? The model for the elasticity behaviour has 3 phases a loading phase where additional data in ingested, followed by a possible reorganization phase, followed by a query phase whereby users study the data. These phases repeat indefinitely, and the job of an elasticity system is three fold: 1.predict when resources will be exhausted 2.take corrective action to add another quanta of storage and processing 3.reorganize the database onto the extra node(s) to optimize future processing of the query load.
Elastic Array Partitioning Elastic array partitioners are designed to incrementally reorganize an array’s storage, moving only the data necessary to rebalance storage load. Hash Partitioning Hash partitioning is well-suited for fine-grained storage partitioning, because it places chunks one at a time, rather than having to subdivide planes in array space. Hence, equi-joins and most “embarrassingly parallel” operations are best served by hash partitioning. There are two basic approaches for elastic hash partitioning:- 1. Extendible Hash:- This is optimized for skewed data. The algorithm begins with a set of hash buckets, one per node. When the cluster increases in size, the partitioner splits the hash bucket of the most heavily loaded hosts, partially redistributing their contents to the new nodes. 2. Consistent Hash:- This is optimized for data that is evenly distributed throughout an array. It is an hashmap distributed around the circumference of a circle, where both nodes and chunks are hashed to an integer, which designates their position on the circle’s edge.
Range Partitioning It has the best performance for queries that have clustered data access. There are three strategies for clustered data partitioning: A K-d Tree is an efficient strategy for range partitioning skewed, multidimensional data. The K-d Tree stores its partitioning table as a binary tree. Uniform Range : This partitioner is optimized for unskewed arrays. This approach has a complicated global reorganization at every cluster expansion to maintain this balance. Append strategy: This partitioner adjusts its layout based on storage size, rather than logical chunk count and has minimal overhead for data reorganizations.
Elastic Partitioner Results and Conclusion Cost of redistribution : Append is a clear winner in this space, as it does not rebalance the data; K-d Tree and hash partitioning both also perform well; Uniform Range globally redistribute the data, and hence have a higher time requirement. Load balancing : Skew strongly influences the performance of our range partitioners. Append exhibits poor load balancing overall. Consistent Hash, Extendible Hash do best because they subdivide the data at its finest granularity, by its chunks. For data loading and reorganization, the append approach is fastest, but this speed comes at a cost when the database executes queries over imbalanced storage. However, K-d Tree is the most effective partitioner for our array workloads,
THANK YOU