CS294, YelickDataStructs, p1 CS Distributed Data Structures
CS294, YelickDataStructs, p2 Agenda Overview Interface Issues Implementation Techniques Fault Tolerance Performance
CS294, YelickDataStructs, p3 Overview Distributed data structures are an obvious abstraction for distributed systems. Right? What do you want to hide within one? –Data layout? –When communication is required? –# and location of replicas –Load balancing
CS294, YelickDataStructs, p4 Distributed Data Structures Most of these are containers Two fundamentally difference kinds: –Those with integrators or ability to look at all container elements Arrays, meshes, databases*, graphs* and trees* (sometimes) –Those with only single element ops Queue, directory (hash table or tree), all *’d items above
CS294, YelickDataStructs, p5 DDS in Ninja Described in Gribble, Brewer, Hellerstein, Culler A distributed data structure (DDS) is a self-managing layer for persistent data. –High availability, concurrency, consistency, durability, fault tolerance, scalability A distributed hash table is an example –Uses two-phase commits for consistency –Partitioning for scalability
CS294, YelickDataStructs, p6 Scheduling Structures In serial code, most scheduling is done with a stack (often implicit), a FIFO queue, or a priority queue Do all of these makes sense in a distributed setting? Are there others?
CS294, YelickDataStructs, p7 Distributed Queues Load balancing (work stealing…) –Push new work onto a stack –Execute locally by popping from the stack –Steal remotely by removing from the bottom of the stack (FIFO)
CS294, YelickDataStructs, p8 Interfaces (1) Blocking atomic interfaces: operations happen between invocation and return –Internally each operation performs locking or other form of synchronization Non-blocking “atomic” interfaces: operation happens sometime after invocation –Often paired with completion synchronization Request/response for each operation Wait for all “my” operations to complete Wait for all operations in the world to complete
CS294, YelickDataStructs, p9 Interfaces (2) Non-atomic interface: use external synchronization –Undefined under certain kinds (or all) concurrency –May be paired with bracketing synchronization Aquire-insert-lock, insert, insert, Release-insert-lock Begin-transaction… Operations with no semantics (no-ops) –Prefetch, Flush copies, … Operations that allow for failures –Signal “failed”
CS294, YelickDataStructs, p10 DDS Interfaces Contrast: –RDBMS’s provide ACID semantics on transactions –Distributed files systems: NFS weak, Frangipani and AFS stronger DDS: –All operations on elements are atomic (indivisible, all or nothing) This seems to mean that the hash table operations that involve a single element are atomic –One-copy equivalence: replication of elements is invisible –No transaction across elements or operations
CS294, YelickDataStructs, p11 Implementation Strategies (1) Two simple techniques –Partitioning: Used when the d.s. is large Used when writes/updates are frequent –Replication: Used when writes are infrequent and reads are very frequent Used to tolerate failures Full static replication is extreme; dynamic partial replication is more common Many hybrids and variations
CS294, YelickDataStructs, p12 Implementation Strategies (2) Moving data to computation good for: –dynamic load balancing I.e., idle processors grab work –smaller objects in ops involving > 1 object Moving computation to data good for: –large data structures Other?
CS294, YelickDataStructs, p13 DDS: Distributed Hash Table Operations include: –Create, Destroy –Put, Get, and Remove Built with storage “bricks” –Each manage a single node, network-visible hash table –Contain a buffer cache, lock manager, network stubs and skeletons Data is partitioned, and partitions are replicated –Replica groups are used for each partition
CS294, YelickDataStructs, p14 DDS: Distributed Hash Table Operations on elements: –Get – use any replica in appropriate group –Put or remove – update all replicas in group using two-phase commit DDS library is commit coordinator If individual node crashes during commit phase, it is removed from replica If DDS fails during commit phase, individual nodes will coordinate: if any have committed, all must
CS294, YelickDataStructs, p15 DDS: Hash Table RG nameRG members 000dds1,dds2 100dds2 10dds5,dds4 01dds7 011dds5,dds3 111dds2 Key: DP map RG map
CS294, YelickDataStructs, p16 Example: Aleph Directory Maps names to mobile objects –Files, locks (?), processes,… Interested in performance at scale, not reliability Two basic protocols: –Home: each object has a fixed “home” PE that keeps track of cache copies –Arrow: based on path-reversal idea
CS294, YelickDataStructs, p17 Path Reversal Find
CS294, YelickDataStructs, p18 Path Reversal
CS294, YelickDataStructs, p19 Aleph Directory Performance Aleph is implemented as Java packages on top of RMI (and UDP?) Run on small systems (up to 16 nodes) –Assumed that “home” centralized solution would be faster at this scale 2 messages to request; 2 to retrieve –Arrow was actually faster Log 2 p to request; 1 to retrieve In practice, only 2 to request (counter ex.)
CS294, YelickDataStructs, p20 Hybrid Directory Protocol Essentially the same as the “home” protocol, except Link waiting processors into a chain (across the processors) –Each keeps the id of the processor ahead of it in the chain Under high contention, resource moves down the chain Performance: –Faster than home and arrow on counter benchmark and some others…
CS294, YelickDataStructs, p21 How Many Data Structures? Gribble et al claim: –“We believe that given a small set of DDS types (such as a hash table, a tree, and an administrative log), authors will be able to build a large class of interesting and sophisticated servers.” –Do you believe this? –What does it imply about tools vs. libraries?
CS294, YelickDataStructs, p22 Administrivia Gautam Kar and Joe L. Hellerstein speaking Thursday –Papers online –Contact me about meeting with them Final projects: –Send mail to schedule meeting with me Next week: –Tuesday: guest lecture by Aaron Brown on benchmarks; related to Kar and Hellerstein work. –Still to come: Gray, Lamport, and Liskov