CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49.

Slides:



Advertisements
Similar presentations
Dynamo: Amazon’s Highly Available Key-value Store
Advertisements

Dynamo: Amazon’s Highly Available Key-value Store Slides taken from created by paper authors Giuseppe DeCandia, Deniz Hastorun,
Dynamo: Amazon’s Highly Available Key-value Store ID2210-VT13 Slides by Tallat M. Shafaat.
Case Study - Amazon. Amazon r Amazon has many Data Centers r Hundreds of services r Thousands of commodity machines r Millions of customers at peak times.
AMAZON’S KEY-VALUE STORE: DYNAMO DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available.
D YNAMO : A MAZON ’ S H IGHLY A VAILABLE K EY - V ALUE S TORE Presented By Roni Hyam Ami Desai.
Distributed Hash Tables Chord and Dynamo Costin Raiciu, Advanced Topics in Distributed Systems 18/12/2012.
Amazon’s Dynamo Simple Cloud Storage. Foundations 1970 – E.F. Codd “A Relational Model of Data for Large Shared Data Banks”E.F. Codd –Idea of tabular.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Amazon Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google,
Dynamo: Amazon’s Highly Available Key-value Store Adopted from slides and/or materials by paper authors (Giuseppe DeCandia, Deniz Hastorun, Madan Jampani,
1 Dynamo Amazon’s Highly Available Key-value Store Scott Dougan.
Dynamo Highly Available Key-Value Store 1Dennis Kafura – CS5204 – Operating Systems.
Dynamo Kay Ousterhout. Goals Small files Always writeable Low latency – Measured at 99.9 th percentile.
Dynamo: Amazon’s Highly Available Key- value Store (SOSP’07) Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman,
Versioning and Eventual Consistency COS 461: Computer Networks Spring 2011 Mike Freedman 1.
Dynamo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as well as related cloud storage implementations.
CS162 Operating Systems and Systems Programming Key Value Storage Systems November 3, 2014 Ion Stoica.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Amazon’s Dynamo System The material is taken from “Dynamo: Amazon’s Highly Available Key-value Store,” by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,
Dynamo: Amazon's Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, et.al., SOSP ‘07.
Cloud Storage – A look at Amazon’s Dyanmo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as.
Dynamo: Amazon’s Highly Available Key-value Store Presented By: Devarsh Patel 1CS5204 – Operating Systems.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed File Systems Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Case Study: Amazon Dynamo Steve Ko Computer Sciences and Engineering University at Buffalo.
Peer-to-Peer in the Datacenter: Amazon Dynamo Aaron Blankstein COS 461: Computer Networks Lectures: MW 10-10:50am in Architecture N101
Dynamo: Amazon’s Highly Available Key-value Store COSC7388 – Advanced Distributed Computing Presented By: Eshwar Rohit
Dynamo: Amazon's Highly Available Key-value Store Dr. Yingwu Zhu.
Dynamo: Amazon’s Highly Available Key-value Store DeCandia, Hastorun, Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels PRESENTED.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.
D YNAMO : A MAZON ’ S H IGHLY A VAILABLE K EY - VALUE S TORE Presenters: Pourya Aliabadi Boshra Ardallani Paria Rakhshani 1 Professor : Dr Sheykh Esmaili.
Dynamo: Amazon’s Highly Available Key-value Store
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Amazon Dynamo Steve Ko Computer Sciences and Engineering University at Buffalo.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Data Versioning Lecturer.
CS162 Operating Systems and Systems Programming Lecture 15 Key-Value Storage, Network Protocols October 28, 2013 Anthony D. Joseph and John Canny
CPT-S Topics in Computer Science Big Data 1 1 Yinghui Wu EME 49.
Peer to Peer Networks Distributed Hash Tables Chord, Kelips, Dynamo Galen Marchetti, Cornell University.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
CS162 Operating Systems and Systems Programming Lecture 14 Key Value Storage Systems March 12, 2012 Anthony D. Joseph and Ion Stoica
CS162 Operating Systems and Systems Programming Lecture 15 Key-Value Storage, Network Protocols March 20, 2013 Anthony D. Joseph
CS162 Operating Systems and Systems Programming Lecture 15 Key-Value Storage, Network Protocols October 22, 2012 Ion Stoica
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Big Data Yuan Xue CS 292 Special topics on.
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 1 Dynamo: Amazon.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.
Nick McKeown CS244: Lecture 17 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications [Stoica et al 2001] Thanks to Ion Stoica.
CPT-S 415 Big Data 1 1 Yinghui Wu EME B45. 2 CPT-S 415 Big Data Beyond RDBMS noSQL databases noSQL: concept and theory –CAP theory –ACID vs EASE –noSQL.
16 July 2015 Charles Reiss
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
CSE 486/586 Distributed Systems Consistency --- 1
Dynamo: Amazon’s Highly Available Key-value Store
Lecturer : Dr. Pavle Mogin
Distributed File Systems 17: Distributed File Systems
EECS 498 Introduction to Distributed Systems Fall 2017
Dynamo Recap Hadoop & Spark
EECS 498 Introduction to Distributed Systems Fall 2017
March 19, 2014 Anthony D. Joseph CS162 Operating Systems and Systems Programming Lecture 15 Key-Value Storage, Network.
CSE 486/586 Distributed Systems Consistency --- 1
Key-Value Tables: Chord and DynamoDB (Lecture 16, cs262a)
EEC 688/788 Secure and Dependable Computing
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
Scalable Causal Consistency
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
EEC 688/788 Secure and Dependable Computing
CS162 Operating Systems and Systems Programming Lecture 23 TCP/IP (Finished), Distributed Storage, Key-Value Stores April 25th, 2016 Prof. Anthony D.
CSE 486/586 Distributed Systems Consistency --- 2
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
Presentation transcript:

CPT-S Advanced Databases 11 Yinghui Wu EME 49

Key-value store 2

KV-stores and Relational Tables KV-stores: viewed as two-column (key, value) tables with a single key column. can be used to implement more complicated relational tables: StateIDPopulationAreaSenator_1 Alabama14,822,02352,419Sessions Alaska2731,449663,267Begich Arizona36,553,255113,998Boozman Arkansas42,949,13153,178Flake California538,041,430163,695Boxer Colorado65,187,582104,094Bennet …… Index

KV-stores and Relational Tables The KV-version of the previous table includes one table indexed by the actual key, and others by an ID. StateID Alabama1 Alaska2 Arizona3 Arkansas4 California5 Colorado6 …… IDPopulation 14,822, ,449 36,553,255 42,949, ,041,430 65,187,582 …… IDArea 152, , , , , ,094 …… IDSenator_1 1Sessions 2Begich 3Boozman 4Flake 5Boxer 6Bennet ……

KV-stores and Relational Tables You can add indices with new KV-tables: Thus KV-tables are used for column-based storage, as opposed to row- based storage typical in older DBMS. … OR: the value field can contain complex data (next page) StateID Alabama1 Alaska2 Arizona3 Arkansas4 California5 Colorado6 …… IDPopulation 14,822, ,449 36,553,255 42,949, ,041,430 65,187,582 …… Senator_1ID Sessions1 Begich2 Boozman3 Flake4 Boxer5 Bennet6 …… Index Index_2

Amazon: –Key: customerID –Value: customer profile (e.g., buying history, credit card,..) Facebook, Twitter: –Key: UserID –Value: user profile (e.g., posting history, photos, friends, …) iCloud/iTunes: –Key: Movie/song name –Value: Movie, Song Distributed file systems –Key: Block ID –Value: Block Key-Values: Examples

System Examples Google File System, Hadoop Dist. File Systems (HDFS) Amazon –Dynamo: internal key value store used to power Amazon.com (shopping cart) –Simple Storage System (S3) BigTable/HBase/Hypertable: distributed, scalable data storage Voldemort Cassandra: “distributed data management system” (Facebook) Memcached: in-memory key-value store for small chunks of arbitrary data (strings, objects) eDonkey/eMule: peer-to-peer sharing system

Key-Value Store Also called a Distributed Hash Table (DHT) Main idea: partition set of key-values across many machines key, value …

Key Operators put(key, value): where do you store a new (key, value) tuple? get(key): where is the value associated with a given “key” stored? And, do the above while providing –Fault Tolerance –Scalability –Consistency

Directory-Based Architecture Have a node maintain the mapping between keys and the machines (nodes) that store the values associated with the keys … N1N1 N2N2 N3N3 N 50 K5V5K14V14K105V105 K5N2 K14N3 K105N50 Master/Directory put(K14, V14)

Directory-Based Architecture Have a node maintain the mapping between keys and the machines (nodes) that store the values associated with the keys … N1N1 N2N2 N3N3 N 50 K5V5K14V14K105V105 K5N2 K14N3 K105N50 Master/Directory get(K14) V14

Directory-Based Architecture Having the master relay the requests  recursive query Another method: iterative query (this slide) –Return node to requester and let requester contact node … N1N1 N2N2 N3N3 N 50 K5V5K14V14K105V105 K5N2 K14N3 K105N50 Master/Directory put(K14, V14) N3

Directory-Based Architecture Having the master relay the requests  recursive query Another method: iterative query –Return node to requester and let requester contact node … N1N1 N2N2 N3N3 N 50 K5V5K14V14K105V105 K5N2 K14N3 K105N50 Master/Directory get(K14) V14 N3

Iterative vs. Recursive Query Recursive Query: –Advantages: Faster (latency), as typically master/directory closer to nodes Easier to maintain consistency, as master/directory can serialize puts()/gets() –Disadvantages: scalability bottleneck, as all “Values” go through master/directory Iterative Query –Advantages: more scalable –Disadvantages: slower (latency), harder to enforce data consistency … N1 N2N3N50 K14 V14 K14N3 Master/Directory get(K14) V14 Recursive … N1 N2N3N50 K14 V14 K14 N3 Master/Directory get(K14) V14 N3 Iterative

Challenges Fault Tolerance: handle machine failures without losing data and without degradation in performance Scalability: –Need to scale to thousands of machines –Need to allow easy addition of new machines Consistency: maintain data consistency in face of node failures and message losses Heterogeneity (if deployed as peer-to-peer systems): –Latency: 1ms to 1000ms –Bandwidth: 32Kb/s to several GB/s …

Fault Tolerance Replicate value on several nodes Usually, place replicas on different racks in a datacenter to guard against rack failures (recursive version) … N1N1 N2N2 N3N3 N 50 K5V5K14V14K105V105 K5N2 K14N1,N3 K105N50 Master/Directory put(K14, V14) put(K14, V14), N1 N1, N3 K14V14 put(K14, V14)

Fault Tolerance Again, we can have –Recursive replication (previous slide) –Iterative replication (this slide) … N1N1 N2N2 N3N3 N 50 K5V5K14V14K105V105 K5N2 K14N1,N3 K105N50 Master/Directory put(K14, V14) N1, N3 K14V14 put(K14, V14)

Consistency How close does a distributed system emulate a single machine in terms of read and write semantics? Q: Assume put(K14, V14’) and put(K14, V14’’) are concurrent, what value ends up being stored? Q: Assume a client calls put(K14, V14) and then get(K14), what is the result returned by get()? Above semantics, not trivial to achieve in distributed systems

Concurrent Writes (Updates) If concurrent updates (i.e., puts to same key) may need to make sure that updates happen in the same order … N1N1 N2N2 N3N3 N 50 K5V5K14V14K105V105 K5N2 K14N1,N3 K105N50 Master/Directory put(K14, V14’) K14V14 put(K14, V14’’) K14V14’K14V14’’ put(K14, V14’) and put(K14, V14’’) reach N1 and N3 in different order What does get(K14) return? Undefined! put(K14, V14’) and put(K14, V14’’) reach N1 and N3 in different order What does get(K14) return? Undefined!

Concurrent Writes (Updates) If concurrent updates (i.e., puts to same key) may need to make sure that updates happen in the same order … N1N1 N2N2 N3N3 N 50 K5V5K14V14K105V105 K5N2 K14N1,N3 K105N50 Master/Directory put(K14, V14’) K14V14 put(K14, V14’’) put(K14, V14’) put(K14, V14’’) K14V14’’K14V14’ put(K14, V14’) and put(K14, V14’’) reach N1 and N3 in different order What does get(K14) return? Undefined! put(K14, V14’) and put(K14, V14’’) reach N1 and N3 in different order What does get(K14) return? Undefined!

Read after Write Read not guaranteed to return value of latest write –Can happen if Master processes requests in different threads … N1N1 N2N2 N3N3 N 50 K5V5K14V14K105V105 K5N2 K14N1,N3 K105N50 Master/Directory get(K14) get(K14, V14’) K14V14 put(K14, V14’) K14V14’K14V14’ get(K14) happens right after put(K14, V14’) get(K14) reaches N3 before put(K14, V14’)! get(K14) happens right after put(K14, V14’) get(K14) reaches N3 before put(K14, V14’)! V14

Strong Consistency Assume Master serializes all operations Challenge: master becomes a bottleneck –Not addressed here Still want to improve performance of reads/writes  quorum consensus

Quorum Consensus Improve put() and get() operation performance Define a replica set of size N put() waits for acks from at least W replicas get() waits for responses from at least R replicas W+R > N Why does it work? –There is at least one node that contains the update

Quorum Consensus Example N=3, W=2, R=2 Replica set for K14: {N1, N3, N4} Assume put() on N3 fails N1N1 N2N2 N3N3 N4N4 K14V14K14V14 put(K14, V14) ACK put(K14, V14) ACK

Quorum Consensus Example Now, issuing get() to any two nodes out of three will return the answer N1N1 N2N2 N3N3 N4N4 K14V14K14V14 get(K14) V14 get(K14) nill

Summary: Key-Value Store Very large scale storage systems Two operations –put(key, value) –value = get(key) Challenges –Fault Tolerance  replication –Scalability  serve get()’s in parallel; replicate/cache hot tuples –Consistency  quorum consensus to improve put/get performance System case study: Dynamo

Key-value store case study: Dynamo 27

28 Amazon’s platform architecture

Summary of techniques used in Dynamo and their advantages ProblemTechniqueAdvantage PartitioningConsistent HashingIncremental Scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates. Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available. Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background. Membership and failure detection Gossip-based membership protocol and failure detection. Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.

System Assumptions and Requirements simple read and write operations to a data item that is uniquely identified by a key. Most of Amazon’s services can work with this simple query model and do not need any relational schema. targeted applications - store objects that are relatively small (usually less than 1 MB) Dynamo targets applications that operate with weaker consistency (the “C” in ACID) if this results in high availability. No guarantee on Isolation there are no security related requirements such as authentication and authorization

System architecture Partitioning High Availability for writes Handling temporary failures Recovering from permanent failures Membership and failure detection

Partition Algorithm Consistent hashing: the output range of a hash function is treated as a fixed circular space or “ring”. ”Virtual Nodes”: Each node can be responsible for more than one virtual node.

Replication Each data item is replicated at N hosts. “preference list”: The list of nodes that is responsible for storing a particular key.

Vector Clocks Used for conflict detection of data. Timestamp based resolution of conflicts is not enough. Time 1: Time 2: Time 3: Replicated Time 4: Time 5: Replicated Conflict detection Update

Vector Clock A vector clock is a list of (node, counter) pairs. Every version of every object is associated with one vector clock. If the counters on the first object’s clock are less-than- or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.

Summary of techniques used in Dynamo and their advantages ProblemTechniqueAdvantage PartitioningConsistent HashingIncremental Scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates. Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available. Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background. Membership and failure detection Gossip-based membership protocol and failure detection. Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.

Scalability Storage: use more nodes Request Throughput: –Can serve requests from all nodes on which a value is stored in parallel –Large “values” can be broken into blocks (HDFS files are broken up this way) –Master can replicate a popular value on more nodes Master/directory scalability: –Replicate it –Partition it, so different keys are served by different masters/directories

Scalability: Load Balancing Directory keeps track of the storage availability at each node –Preferentially insert new values on nodes with more storage available What happens when a new node is added? –Cannot insert only new values on new node. Why? –Move values from the heavy loaded nodes to the new node What happens when a node fails? –Need to replicate values from failed node to other nodes

Replication Challenges Need to make sure that a value is replicated correctly How do you know a value has been replicated on every node? –Wait for acknowledgements from every node What happens if a node fails during replication? –Pick another node and try again What happens if a node is slow? –Slow down the entire put()? Pick another node In general, with multiple replicas –Slow puts and fast gets

Consistency (cont’d) Large variety of consistency models: –Atomic consistency (linearizability): reads/writes (gets/puts) to replicas appear as if there was a single underlying replica (single system image) Think “one updated at a time” –Eventual consistency: given enough time all updates will propagate through the system One of the weakest forms of consistency; used by many systems in practice –And many others: causal consistency, sequential consistency, strong consistency, …