Elasticity in SciDB DBMS Team Members Gunjan Sharma(MT15015) Hiya popli(MT15020)

Slides:

Advertisements

Similar presentations

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide

Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

C-Store: Self-Organizing Tuple Reconstruction Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 17, 2009.

Fast Algorithms For Hierarchical Range Histogram Constructions

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.

Multidimensional Indexing

BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.

1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Chapter 4 Parallel Sort and GroupBy 4.1Sorting, Duplicate Removal and Aggregate 4.2Serial External Sorting Method 4.3Algorithms for Parallel External Sort.

PARTITIONING “ A de-normalization practice in which relations are split instead of merger ”

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.

Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 

1 Index Structures. 2 Chapter : Objectives Types of Single-level Ordered Indexes Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes.

Trevor Brown – University of Toronto B-slack trees: Space efficient B-trees.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.

1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Designing Aggregations. Performance Fundamentals - Aggregations Pre-calculated summaries of data Intersections of levels from each dimension Tradeoff.

Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

AQWA Adaptive Query-Workload-Aware Partitioning of Big Spatial Data Dimosthenis Stefanidis Stelios Nikolaou.

NCLAB 1 Supporting complex queries in a distributed manner without using DHT NodeWiz: Peer-to-Peer Resource Discovery for Grids Sujoy Basu, Sujata Banerjee,

REED ： Robust, Efficient Filtering and Event Detection in Sensor Network Daniel J. Abadi, Samuel Madden, Wolfgang Lindner Proceedings of the 31st VLDB.

10/3/2017 Chapter 6 Index Structures.

Introduction toData structures and Algorithms

Database Applications (15-415) DBMS Internals- Part V Lecture 14, Oct 18, 2016 Mohammad Hammoud.

Practical Database Design and Tuning

Data Indexing Herbert A. Evans.

Indexing Structures for Files and Physical Database Design

CS 540 Database Management Systems

ChaNGa: Design Issues in High Performance Cosmology

Multidimensional Access Structures

Physical Database Design and Performance

Spatial Indexing I Point Access Methods.

Hash-Based Indexes Chapter 11

Database Management Systems (CS 564)

Physical Database Design for Relational Databases Step 3 – Step 8

Database Performance Tuning and Query Optimization

Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.

Database Applications (15-415) DBMS Internals- Part V Lecture 17, March 20, 2018 Mohammad Hammoud.

HashKV: Enabling Efficient Updates in KV Storage via Hashing

April 30th – Scheduling / parallel

Session #, Speaker Name Indexing Chapter 8 11/19/2018.

Edge computing (1) Content Distribution Networks

Practical Database Design and Tuning

Hash-Based Indexes Chapter 10

Indexing and Hashing Basic Concepts Ordered Indices

Managing batch processing Transient Azure SQL Warehouse Resource

Multidimensional Indexes

Advance Database System

Database Systems (資料庫系統)

LINEAR HASHING E0 261 Jayant Haritsa Computer Science and Automation

Database Design and Programming

Chapter 11 Database Performance Tuning and Query Optimization

Chapter 11 Instructor: Xin Zhang

Indexing, Access and Database System Architecture

Presentation transcript:

Elasticity in SciDB DBMS Team Members Gunjan Sharma(MT15015) Hiya popli(MT15020)

An Introduction to SciDB DBMS SciDB is an array-based parallel DBMS oriented toward science applications. The data in such a DBMS has the following characteristics:-  Array Data Model- Most science applications like earth science data, astronomy telescope etc. as well as most of the analytics that the scientists run are fundamentally array oriented and cannot fit into the relational data model.  Sparse or Dense Array- Some arrays have values in each cell(like cooked satellite images) whereas in some cases(like raw satellite imagery) the data is really sparse.  Skewed Data- It is very common in science applications for some regions of array space to have substantially more data than others like when storing resident data for a region.  Visualization Focus- Scientists usually want a visualization system through which they can browse and inspect substantial amounts of data of interest.

Elasticity  A science DBMS should support both data elasticity and processing elasticity without extensive downtime i.e. it is accomplished in background. Why?  The model for the elasticity behaviour has 3 phases a loading phase where additional data in ingested, followed by a possible reorganization phase, followed by a query phase whereby users study the data. These phases repeat indeﬁnitely, and the job of an elasticity system is three fold: 1.predict when resources will be exhausted 2.take corrective action to add another quanta of storage and processing 3.reorganize the database onto the extra node(s) to optimize future processing of the query load.

Elastic Array Partitioning  Elastic array partitioners are designed to incrementally reorganize an array’s storage, moving only the data necessary to rebalance storage load. Hash Partitioning  Hash partitioning is well-suited for ﬁne-grained storage partitioning, because it places chunks one at a time, rather than having to subdivide planes in array space. Hence, equi-joins and most “embarrassingly parallel” operations are best served by hash partitioning.  There are two basic approaches for elastic hash partitioning:- 1. Extendible Hash:- This is optimized for skewed data. The algorithm begins with a set of hash buckets, one per node. When the cluster increases in size, the partitioner splits the hash bucket of the most heavily loaded hosts, partially redistributing their contents to the new nodes. 2. Consistent Hash:- This is optimized for data that is evenly distributed throughout an array. It is an hashmap distributed around the circumference of a circle, where both nodes and chunks are hashed to an integer, which designates their position on the circle’s edge.

Range Partitioning It has the best performance for queries that have clustered data access. There are three strategies for clustered data partitioning:  A K-d Tree is an efficient strategy for range partitioning skewed, multidimensional data. The K-d Tree stores its partitioning table as a binary tree.  Uniform Range : This partitioner is optimized for unskewed arrays. This approach has a complicated global reorganization at every cluster expansion to maintain this balance.  Append strategy: This partitioner adjusts its layout based on storage size, rather than logical chunk count and has minimal overhead for data reorganizations.

Elastic Partitioner Results and Conclusion  Cost of redistribution : Append is a clear winner in this space, as it does not rebalance the data; K-d Tree and hash partitioning both also perform well; Uniform Range globally redistribute the data, and hence have a higher time requirement.  Load balancing : Skew strongly influences the performance of our range partitioners. Append exhibits poor load balancing overall. Consistent Hash, Extendible Hash do best because they subdivide the data at its finest granularity, by its chunks.  For data loading and reorganization, the append approach is fastest, but this speed comes at a cost when the database executes queries over imbalanced storage. However, K-d Tree is the most effective partitioner for our array workloads,

THANK YOU