VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Partitioning and Replication.

Slides:



Advertisements
Similar presentations
CS 542: Topics in Distributed Systems Diganta Goswami.
Advertisements

CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
Replication. Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
Amazon’s Dynamo Simple Cloud Storage. Foundations 1970 – E.F. Codd “A Relational Model of Data for Large Shared Data Banks”E.F. Codd –Idea of tabular.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
A Scalable Content-Addressable Network Authors: S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker University of California, Berkeley Presenter:
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
Lesson 1: Configuring Network Load Balancing
Wide-area cooperative storage with CFS
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 / 2014 Advanced Database Design and Implementation Advanced Database.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
University of Pennsylvania 11/21/00CSE 3801 Distributed File Systems CSE 380 Lecture Note 14 Insup Lee.
CS401 presentation1 Effective Replica Allocation in Ad Hoc Networks for Improving Data Accessibility Takahiro Hara Presented by Mingsheng Peng (Proc. IEEE.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Plan for Intro to Cloud Databases
PMIT-6102 Advanced Database Systems
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
MIT Consistent Hashing: Load Balancing in a Changing World David Karger, Eric Lehman, Tom Leighton, Matt Levine, Daniel Lewin, Rina Panigrahy.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.
Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Trade-offs in Cloud.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Data Versioning Lecturer.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Exam and Lecture Overview.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Chord Advanced issues. Analysis Theorem. Search takes O (log N) time (Note that in general, 2 m may be much larger than N) Proof. After log N forwarding.
Fault Tolerant Services
Replication (1). Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
CHAPTER 7 CLUSTERING SERVERS. CLUSTERING TYPES There are 2 types of clustering ; Server clusters Network Load Balancing (NLB) The difference between the.
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Advanced Database Design.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Introduction to Cloud.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Cassandra Architecture.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.
PERFORMANCE MANAGEMENT IMPROVING PERFORMANCE TECHNIQUES Network management system 1.
Cassandra The Fortune Teller
Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng
High Availability Linux (HA Linux)
Integrating HA Legacy Products into OpenSAF based system
Large-scale file systems and Map-Reduce
Trade-offs in Cloud Databases
MongoDB Distributed Write and Read
Partitioning and Replication
Lecturer : Dr. Pavle Mogin
Storage Virtualization
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT -Sumanth Kandagatla Instructor: Prof. Yanqing Zhang Advanced Operating Systems (CSC 8320)
Prof. Leonardo Mostarda University of Camerino
Active replication for fault tolerance
Cloud Computing Architecture
Cloud Computing Architecture
Replica Placement Model: We consider objects (and don’t worry whether they contain just data or code, or both) Distinguish different processes: A process.
A Scalable Peer-to-peer Lookup Service for Internet Applications
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Partitioning and Replication Lecturer : Dr. Pavle Mogin

Advanced Database Design and Implementation 2015 Partitioning and Replication 1 Plan for Data Partitioning and Replication Data partitioning and replication techniques Consistent Hashing –The basic principles –Workload balancing Replication Membership changes –Joining a system –Leaving a system –Readings: Have a look at Readings at the Course Home Page

Advanced Database Design and Implementation 2015 Partitioning and Replication 2 Data Partitioning and Replication There are thee reasons for storing a database on a number of machines (nodes): –The amount of data exceeds the capacity of a single machine, –To allow scaling for load balancing, and –To ensure reliability and availability by replication There are a number of techniques to achieve data partitioning and replication: –Memory caches, –Separating reads from writes, –Clustering, –Sharding

Advanced Database Design and Implementation 2015 Partitioning and Replication 3 Cashes and Separating R and W Memory Caches can be seen as transient, partly partitioned and replicated in-memory databases, since they replicate most frequently requested parts of a database to main memories of a number of servers –Fast response to clients –Off-loading database servers Separating reads from writes: –One or more servers are dedicated to writes (master(s)), –A number of replica servers are dedicated to satisfy reads (slaves), –The master replicates the updates to slaves If the master crashes before completing replication to at least one slave, the write operation is lost The most up to date slave undertakes the role of a master

Advanced Database Design and Implementation 2015 Partitioning and Replication 4 Clustering High-availability clusters (also known as HA clusters or failover clusters) are groups of computers that support server applications They operate by harnessing redundant computers in groups to provide a continued service when system components fail When a HA cluster detects a hardware/software fault, it immediately restarts the application on another system without requiring administrative intervention –A process known as failover HA cluster use redundancy to eliminate single points of failure (SPOF)

Advanced Database Design and Implementation 2015 Partitioning and Replication 5 Sharding Data Sharding is data partitioning in such a way that: –Data typically requested and updated together reside on the same node, and –The work load and storage volume are roughly evenly distributed among servers Data sharding is not an appropriate technique for partitioning relational databases, due to their normalized structure Data sharding is very popular with cloud non relational databases, since they are designed having partitioning in mind

Advanced Database Design and Implementation 2015 Partitioning and Replication 6 Consistent Hashing Consistent hashing is a data partitioning technique An obvious (but naive) way to map a database object o to a partition p on a network node is to hash the object’s primary key k to the set of n available nodes p = hash (k) mod n In a setting where nodes may join and leave at runtime, the simple approach above is not appropriate since all keys have to be remapped and most objects moved to another node Consistent hashing is a special kind of hashing where only (K / n) keys need to be remapped, where K is the number of keys

Advanced Database Design and Implementation 2015 Partitioning and Replication 7 Consistent Hashing (The Main Idea) The main idea behind the consistent hashing is to associate each node with one or more hash value intervals where the interval boundaries are determined by calculating the hash of each node identifier If a node is removed, its interval is taken over by a node with an adjacent interval All the remaining nodes remain unchanged The hash function does not depend on the number of nodes n Consistent hashing is used in the partitioning component of a number of CDBMSs

Advanced Database Design and Implementation 2015 Partitioning and Replication 8 Consistent Hashing (Basic Principles 1) Each database object is mapped to a point on the edge of a circle by hashing its key value Each available machine is mapped to a point on the edge of the same circle To find the node to store an object, the CDBMS: –Hashes the object’s key to a point on the edge of the circle, –Walks clockwise around the circle until it encounters the first node, –Each node contains objects that map between its and the previous node point

Advanced Database Design and Implementation 2015 Partitioning and Replication 9 Consistent Hashing (Example 1) A B C o1o1 o4o4 o3o3 o2o2 Objects o 1 and o 4 are stored on the node A Object o 2 is stored on the node B Object o 3 is stored on the node C

Advanced Database Design and Implementation 2015 Partitioning and Replication 10 Consistent Hashing (Basic Principles 2) If a node leaves the network, the node in the clockwise direction stores all new objects that would belong to the failed node The existing data objects of the failed node have to be redistributed to remaining nodes If a node is added to the network, it is mapped to a point –All objects that map between the point of the new node and the first counter clock wise neighbour, map to the new node

Advanced Database Design and Implementation 2015 Partitioning and Replication 11 Consistent Hashing (Example 2) A B D o1o1 o4o4 o3o3 o2o2 Object o 1 is stored on the node A Object o 2 is stored on the node B Objects o 3 and o 4 are stored on the node D The node C has left and the node D has entered the network

Advanced Database Design and Implementation 2015 Partitioning and Replication 12 Consistent Hashing (Problems) The basic consistent hashing algorithm suffers of a number of problems: 1.Unbalanced distribution of objects to nodes due to different intervals of points belonging to nodes It is the consequence of determining the position of a node as a random number by applying a hash function to its identifier 2.If a node has left the network, objects stored on the node become unavailable 3.If a node joins the network, the adjacent node still stores objects that now belong to the new node But client applications ask the new node for these objects, not the old one (that actually stores objects)

Advanced Database Design and Implementation 2015 Partitioning and Replication 13 Consistent Hashing (Solution 1) An approach to solving the unbalanced distribution of database object is to define a number of virtual nodes for each physical node: –The number of virtual nodes depends on the performance of the physical node (cpu speed, memory and disk capacity) –The identifier of a virtual node is produced by appending the virtual node’s ordinal number to physical node’s identifier –A point on the edge of the circle is assigned to each virtual node –This way database objects hashed to different parts of the circle belong to the same physical node –Experiments show very good balancing after defining a few hundreds of virtual nodes for each physical node By introducing k virtual nodes, each physical node is given k random addresses on the edge of the circle –Virtual node addresses are called tokens

Advanced Database Design and Implementation 2015 Partitioning and Replication 14 Balancing Workload (Example) Physical nodes A and B have three virtual nodes Physical node C has only two virtual nodes Let the physical node A have k virtual nodes. Then, A i for i = 0, 1,..., k-1 is the identifier of the virtual node i of the physical node A A0A0 B0B0 A1A1 C0C0 A2A2 B1B1 B2B2 C1C1

Advanced Database Design and Implementation 2015 Partitioning and Replication 15 Consistent Hashing (Solutions 2&3) Problems caused by leaving of an existing and entering of a new node are solved by introducing a replication factor n (> 1) –This way, the same database object is stored on n consecutive physical nodes (the object’s home node and n – 1 nodes that follow in a clock wise direction) Now, if a physical node leaves the network, its data objects still remain stored on n – 1 nodes following it on the ring and will be found by the basic search algorithm already described If a new node enters the network, some of its data objects will be found by accessing his first clockwise neighbour (the new search algorithm)

Advanced Database Design and Implementation 2015 Partitioning and Replication 16 Replication (Example) Assume replication factor n = 3 Object o 1 will be stored on physical nodes A, B, and C Object o 2 will be stored on physical nodes B, A, and C Object o 3 will be stored on physical nodes C, A, and B o 1 [A, B, C] o 2 [B, A, C] o 3 [C, A, B] If the node A leaves, object o 1 will still be accessible on the node B and node C D If a new node D enters the network, some of former node’s A objects will be accessible on the node A via the node D A0A0 B0B0 A1A1 C0C0 A2A2 B1B1 B2B2 C1C1

Advanced Database Design and Implementation 2015 Partitioning and Replication 17 Membership Changes The process of nodes leaving and joining the network is called membership changes The following slides consider principles of memberships changes that may or may not apply to each Cloud DBMS using gossipping When a new node joins the system: 1.The new node announces its presence and the identifier to adjacent nodes (or to all nodes) via broadcast 2.The neighbours react by adjusting their object and replica ownerships 3.The new node receives copies of datasets it is now responsible for from its neighbours

Advanced Database Design and Implementation 2015 Partitioning and Replication 18 Node X Joins the System Tell H, A, B, C, D that I join A H X B C D E F G Replication factor n = 3 Copy Range GH to X Drop Range GH Drop Range HA Drop Range AX Copy Range HA to X Copy Range AX to X Split Range AB into Range AX and Range XB Only physical nodes shown

Advanced Database Design and Implementation 2015 Partitioning and Replication 19 Membership Changes (Leaving) If a node departs the network (for any reason): 1.The other nodes have to be able to detect its departure, 2.When the departure has been detected, the neighbours have to exchange data with each other and to adjust their object and replica ownerships It is common that no notification is given if a node departs for any reason (crush, maintenance, decrease in the work load) Nodes within a system communicate regularly and if a node is not responding, it has departed The remaining nodes redistribute data of the departed node from replicas and combine ranges of the departed node and its clock wise neighbour

Advanced Database Design and Implementation 2015 Partitioning and Replication 20 Node B Departs the System A H B C D E F G Replication factor n = 3 Copy Range GH to C Departed Copy Range HA to D Make Range AC from Range AB and Range BC Copy Range AB to E Only physical nodes shown

Advanced Database Design and Implementation 2015 Partitioning and Replication 21 Consistency and Availability Trade-offs Replication factor 3 Strong consistency required How many nodes In total can be down (not available)? Worst case: 1 node A B C D F E G H Best case: 2 nodes

Advanced Database Design and Implementation 2015 Partitioning and Replication 22 Consistency and Availability Trade-offs Replication factor 3 Eventual consistency required How many nodes can be down (not available)? Worst case: 2 nodes A B C D F E G H Best case: 5 nodes

Advanced Database Design and Implementation 2015 Partitioning and Replication 23 Summary (1) Techniques to achieve data partitioning and replication are: –Memory caches, –Separating reads from writes, –Clustering, and –Sharding The main idea of consistent hashing is to associate each physical node with one or more hash value intervals where hash values of node identifiers represent interval boundaries –Introducing the virtual nodes solves the problem of unbalanced work load –Introducing replication solves the problems caused by nodes leaving and joining the network

Advanced Database Design and Implementation 2015 Partitioning and Replication 24 Summary(2) The process of nodes leaving and joining the network is called membership changes –When a node leaves the network, other nodes combine its range and the range of its clockwise neighbor and redistribute data –If a node joins the network, the neighbors react by adjusting their object and replica ownerships and the new node receives copies of datasets it is now responsible for