Download presentation
Presentation is loading. Please wait.
1
G22.3250-001 Robert Grimm New York University (with some slides by Steve Gribble) Distributed Data Structures for Internet Services
2
Altogether Now: The Three Questions What is the problem? What is new or different or notable? What are the contributions and limitations?
3
Clusters, Clusters, Clusters Let’s broaden the goals for cluster-based services Incremental scalability High availability Operational manageability And also data consistency But what to do if the data has to be persistent? TACC works best for read-only data Porcupine works best for a limited group of services Email, news, bulletin boards, calendaring
4
Enter Distributed Data Structures (DDS) In-memory, single site application interface Persistent, distributed, replicated implementation Clean consistency model Atomic operations (but no transactions) Independent of accessing nodes (functional homogeneity)
5
DDS’s as an Intermediate Design Point Relational databases Strong guarantees (ACID) But also high overhead, complexity Logical structure very much independent of physical layout Distributed data structures Atomic operations, one-copy equivalence Familiar, frequently used interface: hash table, tree, log Distributed file systems Weak guarantees (e.g., close/open consistency) Low-level interface with little data independence Applications impose structure on directories, files, bytes
6
Design Principles Separate concerns Service code implements application Storage management is reusable, recoverable Appeal to properties of clusters Generally secure and well-administered Fast network, uninterruptible power Design for high throughput and high concurrency Use event-driven implementation Make it easy to compose components Make it easy to absorb bursts (in event queues)
7
Assumptions No network partitions within cluster Highly redundant network DDS components are fail-stop Components implemented to terminate themselves Failures are independent Messaging is synchronous Bounded time for delivery Workload has no extreme hotspots (for hash table) Population density over key space is even Working set of hot keys is larger than # of cluster nodes
8
Distributed Hash Tables (in a Cluster…)
9
DHT Architecture
10
Cluster-Wide Metadata Structures
11
Metadata Maps Why is two-phase commit acceptable for DDS’s?
12
Recovery
13
Experimental Evaluation Cluster of 28 2-way SMPs and 38 4-way SMPs To a total of 208 500 MHZ Pentium CPUs 2-way SMPs: 500 MB RAM, 100 Mbs switched Ethernet 4-way SMPs: 1 GB RAM, 1 Gbs switched Ethernet Implementation written in Jāvā Sun’s JDK 1.1.7v3, OpenJIT, Linux user-level threads Load generators run within cluster 80 nodes necessary to saturate 128 storage bricks
14
Scalability: Reads and Writes
15
Graceful Degradation (Reads)
16
Unexpected Imbalance (Writes) What’s going on?
17
Capacity
18
Recovery Behavior 1 brick fails | Recovery | GC in action Buffer cache warm up Normal
19
So, All Is Good?
20
Assumptions Considered Harmful! Central insight, based on experience with DDS “Any system that attempts to gain robustness solely through precognition is prone to fragility” In other words Complex systems are so complex that they are impossible to understand completely, especially when operating outside their expected range
21
Assumptions in Action Bounded synchrony Timeout four orders of magnitude higher than common case round trip time But garbage collection may take a very long time The result is a catastrophic drop in throughput Independent failures Race condition in two-phase commit caused latent memory leak (10 KB/minute under normal operation) All bricks failed predictably within 10-20 minutes of each other After all, they were started at about the same time The result is a catastrophic loss of data
22
Assumptions in Action (cont.) Fail-stop components Session layer uses synchronous connect() method Another graduate student adds firewalled machine to cluster, resulting in nodes locking up for 15 minutes at a time The result is a catastrophic corruption of data
23
What Can We Do? Systematically overprovision the system But doesn’t that mean predicting the future, again? Use admission control But this can still result in livelock, only later… Build introspection into the system Need to easily quantify behavior in order to adapt Close the control loop Make the system adapt automatically (but see previous) Plan for failures Use transactions, checkpoint frequently, reboot proactively
24
What Do You Think?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.