A brief history of key-value stores

A brief history of key-value stores
Landon Cox April 9, 2018

In the year 2000 … Portals were thought to be a good idea
Yahoo!, Lycos, AltaVista, etc. Original content up front + searchable directory The dot-com bubble was about to burst Started to break around 1999 Lots of companies washed out by 2001 Google was really taking off Founded in 1998 PageRank was best-in-class for search Proved: great search is enough (and portals are dumb) Off in the distance: Web 2.0, Facebook, AWS, “the cloud”

Questions of the day How do we build highly-available web services?
Support millions of users Want high-throughput How do we build highly-available peer-to-peer services? Napster had just about been shut down (centralized) BitTorrent was around the corner Want to scale to thousands of nodes No centralized trusted administration or authority Problem: everything can fall apart (and does) Some of the solutions to #2 can help with #1

Storage interfaces What is the interface to a DBMS?
Physical storage Physical storage File hierarchy Logical schema D Attr1 … AttrN D F Val1 … ValN F F mkdir, create open, read, write SQL Query SQL Query Process 1 Process 2 Process 1 Process 2 What is the interface to a file system? What is the interface to a DBMS?

Data independence Data independence
Idea that storage issues should be hidden from programs Programs should operate on data independently of underlying details In what way do FSes and DBs provide data independence? Both hide the physical layout of data Can change layout without altering how programs operate on data In what way do DBs provide stronger data independence? File systems leave format of data within files up to programs One program can alter/corrupt layout file format Database clients cannot corrupt schema definition

ACID properties Databases also ensure ACID What is meant by Atomicity?
Sequences of operations are submitted via transactions All operations in transaction succeed or fail No partial success (or failure) What is meant by Consistency? After transaction commits DB is in “consistent” state Consistency is defined by data invariants i.e., after transaction completes all invariants are true What is the downside of ensuring Consistency? In tension with concurrency and scalability Particularly in distributed settings

ACID properties Databases also ensure ACID What is meant by Isolation?
Other processes cannot view modification of in-flight transactions Similar to atomicity Effects of a transaction cannot be partially viewed What is meant by Durability? After transaction commits data will not be lost Committed transactions survive hardware and software failures

ACID properties Databases also ensure ACID
Do file systems ensure ACID properties? No Atomicity: operations can be buffered, re-ordered, flushed async Consistency: many different consistency models Isolation: hard to ensure isolation without notion of transaction Durability: need to cache undermines guarantees (can use sync) What do file systems offer instead of ACID? Faster performance Greater flexibility for programs Byte-array abstraction rather than table abstraction

Needs of cluster-based storage
Want three things Scalability (incremental addition machines) Availability (failure/loss of machines) Consistency (sensible answers to requests) Traditional DBs fail to provide these features Focus on strong consistency that can hinder scalability and availability Requires a lot of coordination and complexity For file systems, it depends Some offer strong consistency guarantees (poor scalability) Some offer good scalability (poor consistency)

Distributed data structures (DDS)
Paper from OSDI ‘00 Steve Gribble, Eric Brewer, Joseph Hellerstein, and David Culler Pointed out inadequacies for traditional storage for large-scale services Proposed a new storage interface More structured than file systems (structure is provided by DDS) Not as fussy as databases (no SQL) A few operations on data structure elements

Distributed data structures (DDS)
Present a new storage interface More structured than file systems (structure is provided by DDS) Not as fussy as databases (no SQL …) A few operations on data structure elements Storage brick DDS Get, Put Process 1 Key1 Val1 … … Storage brick Process 2 Get, Put KeyN ValN

Distributed Hash Tables (DHTs)
DHT: same idea as DDS but decentralized Same interface as a traditional hash table put(key, value) — stores value under key get(key) — returns all the values stored under key Built over a distributed overlay network Partition key space over available nodes Route each put/get request to appropriate node Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab

How does this work in DNS?
How DHTs Work How do we ensure the put and the get find the same machine? How does this work in DNS? K V K V K V K V k1,v1 k1 K V K V v1 K V K V K V put(k1,v1) K V get(k1) Sean C. Rhea OpenDHT: A Public DHT Service

Nodes form a logical ring
000 110 010 First question: how do new nodes figure out where they should go on the ring? 100 Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab

Step 1: Partition Key Space
Each node in DHT will store some k,v pairs Given a key space K, e.g. [0, 2160): Choose an identifier for each node, idi  K, uniformly at random A pair k,v is stored at the node whose identifier is closest to k Key technique: cryptographic hashing Node id = SHA1(MAC address) P(sha1 collision) <<< P(hardware failure) Nodes can independently compute their id Contrast this to DDS, in which an admin manually assigned nodes to partitions. 2160 Sean C. Rhea OpenDHT: A Public DHT Service

Step 2: Build Overlay Network
Each node has two sets of neighbors Immediate neighbors in the key space Important for correctness Long-hop neighbors Allow puts/gets in O(log n) hops 2160 Sean C. Rhea OpenDHT: A Public DHT Service

Step 3: Route Puts/Gets Thru Overlay
Route greedily, always making progress get(k) 2160 k Sean C. Rhea OpenDHT: A Public DHT Service

How Does Lookup Work? Assign IDs to nodes
Explain the green arrows. Lookup ID Source Assign IDs to nodes Map hash values to node with closest ID Leaf set is successors and predecessors Correctness Routing table matches successively longer prefixes Efficiency 00… 10… 110… 111… Response Each node is assigned an ID randomly IDs are arranged into a ring by numerical order, top ID wraps around to 0 Each node maintains two sets of neighbors, its leaf set and routing table Leaf set is immediate predecessor and successor Routing table neighbors resolve successively longer matching prefixes of node’s own ID Lookup queries are routed greedily to node with closest matching ID (numerically, not lexigraphically) Response is sent directly to querying node Lookup handled differently in other DHTs; see Gummadi et al.’s SIGCOMM paper for details Explain the red arrows. Sean C. Rhea OpenDHT: A Public DHT Service

Iterative vs. recursive
Previous example: recursive lookup Could also perform lookup iteratively: Which one is faster? Recursive Iterative Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab

Previous example: recursive lookup Could also perform lookup iteratively: Why might I want to do this iteratively? Recursive Iterative Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab

Previous example: recursive lookup Could also perform lookup iteratively: What does DNS do and why? Recursive Iterative Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab

Fixing the Embarrassing Slowness of OpenDHT on PlanetLab
(LPC: from Pastry paper) Example routing state Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab

OpenDHT Partitioning responsible for these keys Assign each node an identifier from the key space Store a key-value pair (k,v) on several nodes with IDs closest to k Call them replicas for (k,v) id = 0xC9A1… Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab

OpenDHT Graph Structure
0xED Overlay neighbors match prefixes of local identifier Choose among nodes with same matching prefix length by network latency 0x41 0xC0 0x84 Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab

Performing Gets in OpenDHT
Client sends a get request to gateway Gateway routes it along neighbor links to first replica encountered Replica sends response back directly over IP client get response 0x41 0x6c get(0x6b) gateway Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab

DHTs: The Hype High availability Each key-value pair replicated on multiple nodes Incremental scalability Need more storage/tput? Just add more nodes. Low latency Recursive routing, proximity neighbor selection, server selection, etc. Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab

Robustness Against Failure
If a neighbor dies, a node routes through its next best one If replica dies, remaining replicas create a new one to replace it client 0x41 0x6c 0xC0 Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab

Routing Around Failures
Under churn, neighbors may have failed How to detect failures? acknowledge each hop ACK ACK 2160 k Sean C. Rhea OpenDHT: A Public DHT Service

Routing Around Failures
What if we don’t receive an ACK? resend through different neighbor Timeout! 2160 k Sean C. Rhea OpenDHT: A Public DHT Service

Computing Good Timeouts
What if timeout is too long? increases put/get latency What if timeout is too short? get message explosion Timeout! 2160 k Sean C. Rhea OpenDHT: A Public DHT Service

(LPC) Computing Good Timeouts Three basic approaches to timeouts Safe and static (~5s) Rely on history of observed RTTs (TCP style) Rely on model of RTT based on location 2160 k Sean C. Rhea OpenDHT: A Public DHT Service

Chord errs on the side of caution Very stable, but gives long lookup latencies Timeout! 2160 k Sean C. Rhea OpenDHT: A Public DHT Service

(LPC) Timeout results Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab

Recovering From Failures
Can’t route around failures forever Will eventually run out of neighbors Must also find new nodes as they join Especially important if they’re our immediate predecessors or successors: old responsibility new node 2160 new responsibility Sean C. Rhea OpenDHT: A Public DHT Service

Obvious algorithm: reactive recovery When a node stops sending acknowledgements, notify other neighbors of potential replacements Similar techniques for arrival of new nodes B 2160 C D A Sean C. Rhea OpenDHT: A Public DHT Service

Obvious algorithm: reactive recovery When a node stops sending acknowledgements, notify other neighbors of potential replacements Similar techniques for arrival of new nodes A 2160 A B C D B failed, use D B failed, use A Sean C. Rhea OpenDHT: A Public DHT Service

The Problem with Reactive Recovery
What if B is alive, but network is congested? C still perceives a failure due to dropped ACKs C starts recovery, further congesting network More ACKs likely to be dropped Creates a positive feedback cycle (=BAD) B 2160 C D A B failed, use D B failed, use A Sean C. Rhea OpenDHT: A Public DHT Service

The Problem with Reactive Recovery
What if B is alive, but network is congested? This was the problem with Pastry Combined with poor congestion control, causes network to partition under heavy churn B 2160 C D A B failed, use D B failed, use A Sean C. Rhea OpenDHT: A Public DHT Service

OpenDHT: A Public DHT Service
Periodic Recovery Every period, each node sends its neighbor list to each of its neighbors B E 2160 C D A my neighbors are A, B, D, and E Sean C. Rhea OpenDHT: A Public DHT Service

Periodic Recovery Every period, each node sends its neighbor list to each of its neighbors A 2160 A B C D E my neighbors are A, B, D, and E Sean C. Rhea OpenDHT: A Public DHT Service

Periodic Recovery Every period, each node sends its neighbor list to each of its neighbors How does this break the feedback loop? Volume of recovery msgs independent of failures A 2160 A B C D E my neighbors are A, B, D, and E Sean C. Rhea OpenDHT: A Public DHT Service

Periodic Recovery Every period, each node sends its neighbor list to each of its neighbors Do we need to send the entire list? No, can send delta from last message A 2160 A B C D E my neighbors are A, B, D, and E Sean C. Rhea OpenDHT: A Public DHT Service

Periodic Recovery Every period, each node sends its neighbor list to each of its neighbors What if we contact only a random neighbor (instead of all neighbors)? Still converges in log(k) rounds (k=num neighbors) A 2160 A B C D E my neighbors are A, B, D, and E Sean C. Rhea OpenDHT: A Public DHT Service

(LPC) Recovery results Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab

More key-value stores Two settings in which you can use DHTs DDS in a cluster Bamboo on the open Internet How is “the cloud” (e.g., EC2) different/similar? Cloud is a combination of fast/slow networks Cloud is under a single administrative domain Cloud machines should fail less frequently Hyperdex targets this more forgiving environment Sean C. Rhea Fixing the Embarrassing Slowness of OpenDHT on PlanetLab

A brief history of key-value stores

Similar presentations

Presentation on theme: "A brief history of key-value stores"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A brief history of key-value stores

Similar presentations

Presentation on theme: "A brief history of key-value stores"— Presentation transcript:

Similar presentations

About project

Feedback