G22.3250-001 Robert Grimm New York University (with some slides by Steve Gribble) Distributed Data Structures for Internet Services.

G22.3250-001 Robert Grimm New York University (with some slides by Steve Gribble) Distributed Data Structures for Internet Services

Altogether Now: The Three Questions  What is the problem?  What is new or different or notable?  What are the contributions and limitations?

Clusters, Clusters, Clusters  Let’s broaden the goals for cluster-based services  Incremental scalability  High availability  Operational manageability  And also data consistency  But what to do if the data has to be persistent?  TACC works best for read-only data  Porcupine works best for a limited group of services  Email, news, bulletin boards, calendaring

Enter Distributed Data Structures (DDS)  In-memory, single site application interface  Persistent, distributed, replicated implementation  Clean consistency model  Atomic operations (but no transactions)  Independent of accessing nodes (functional homogeneity)

DDS’s as an Intermediate Design Point  Relational databases  Strong guarantees (ACID)  But also high overhead, complexity  Logical structure very much independent of physical layout  Distributed data structures  Atomic operations, one-copy equivalence  Familiar, frequently used interface: hash table, tree, log  Distributed file systems  Weak guarantees (e.g., close/open consistency)  Low-level interface with little data independence  Applications impose structure on directories, files, bytes

Design Principles  Separate concerns  Service code implements application  Storage management is reusable, recoverable  Appeal to properties of clusters  Generally secure and well-administered  Fast network, uninterruptible power  Design for high throughput and high concurrency  Use event-driven implementation  Make it easy to compose components  Make it easy to absorb bursts (in event queues)

Assumptions  No network partitions within cluster  Highly redundant network  DDS components are fail-stop  Components implemented to terminate themselves  Failures are independent  Messaging is synchronous  Bounded time for delivery  Workload has no extreme hotspots (for hash table)  Population density over key space is even  Working set of hot keys is larger than # of cluster nodes

Distributed Hash Tables (in a Cluster…)

DHT Architecture

Cluster-Wide Metadata Structures

Metadata Maps Why is two-phase commit acceptable for DDS’s?

Recovery

Experimental Evaluation  Cluster of 28 2-way SMPs and 38 4-way SMPs  To a total of 208 500 MHZ Pentium CPUs  2-way SMPs: 500 MB RAM, 100 Mbs switched Ethernet  4-way SMPs: 1 GB RAM, 1 Gbs switched Ethernet  Implementation written in Jāvā  Sun’s JDK 1.1.7v3, OpenJIT, Linux user-level threads  Load generators run within cluster  80 nodes necessary to saturate 128 storage bricks

Scalability: Reads and Writes

Graceful Degradation (Reads)

Unexpected Imbalance (Writes) What’s going on?

Capacity

Recovery Behavior 1 brick fails | Recovery | GC in action Buffer cache warm up Normal

So, All Is Good?

Assumptions Considered Harmful!  Central insight, based on experience with DDS  “Any system that attempts to gain robustness solely through precognition is prone to fragility”  In other words  Complex systems are so complex that they are impossible to understand completely, especially when operating outside their expected range

Assumptions in Action  Bounded synchrony  Timeout four orders of magnitude higher than common case round trip time  But garbage collection may take a very long time  The result is a catastrophic drop in throughput  Independent failures  Race condition in two-phase commit caused latent memory leak (10 KB/minute under normal operation)  All bricks failed predictably within 10-20 minutes of each other  After all, they were started at about the same time  The result is a catastrophic loss of data

Assumptions in Action (cont.)  Fail-stop components  Session layer uses synchronous connect() method  Another graduate student adds firewalled machine to cluster, resulting in nodes locking up for 15 minutes at a time  The result is a catastrophic corruption of data

What Can We Do?  Systematically overprovision the system  But doesn’t that mean predicting the future, again?  Use admission control  But this can still result in livelock, only later…  Build introspection into the system  Need to easily quantify behavior in order to adapt  Close the control loop  Make the system adapt automatically (but see previous)  Plan for failures  Use transactions, checkpoint frequently, reboot proactively

What Do You Think?

G22.3250-001 Robert Grimm New York University (with some slides by Steve Gribble) Distributed Data Structures for Internet Services.

Similar presentations

Presentation on theme: "G22.3250-001 Robert Grimm New York University (with some slides by Steve Gribble) Distributed Data Structures for Internet Services."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

G22.3250-001 Robert Grimm New York University (with some slides by Steve Gribble) Distributed Data Structures for Internet Services.

Similar presentations

Presentation on theme: "G22.3250-001 Robert Grimm New York University (with some slides by Steve Gribble) Distributed Data Structures for Internet Services."— Presentation transcript:

Similar presentations

About project

Feedback