Atlas: An Infrastructure for Global Computing
People Eric Baldeschwieler (UC Berkeley) Bobby Blumofe (UT Austin) Eric Brewer (UC Berkeley)
Outline Introduction Programming model Architecture Examples Discussion Limitations & Conclusion
Introduction Properties of a Internet computing infrastructure Scalability: to 106 nodes Heterogeneity: of machines & OSs Fault tolerance: completion probability comparable to sequential program Adaptive parallelism: dynamic set of resources
Properties ... Safety: Hosts must be secure Anonymity: Secure privacy of client: data & program Hierarchy: Locality of communication (local bandwidth typically is higher) Ease of use: Minimize “costs” of participating. Reasonable performance: Low overhead Benefit from a small set of machines.
Introduction ... Atlas combines mechanisms from: Java “ensures”: Cilk with new mechanisms. Java “ensures”: heterogeneity safety
Introduction ... Atlas: extends Cilk’s work-stealing scheduler to a hierarchical Internet setting uses Cilk-NOW’s mechanisms for: adaptive parallelism fault tolerance
Programming Model Applications are written in Java When a native library is used, heterogeneity is limited to platforms that support it. Programming model is: a Java-based implementation of Cilk: Non-blocking, explicit continuation passing threads a Unix-like URL-based file system & local caching with coherence.
Native libraries (C or C++) Architecture Basic architecture Compute Server Client Manager Application (Java) Runtime library Java interpreter Native libraries (C or C++) Compute Server Compute Server Compute Server
Architecture ... Client is a Java application connects to compute servers on machines other than its manager’s. Idle servers steal work from busy ones.
Architecture Compute server: relinquishes control when there is non-Atlas work (a screensaver?) Runs as a daemon: working pings manager & siblings for work to steal
Architecture: Porting Atlas A Java runtime system Port: natively written URL-based file system some support routines.
Hierarchical Work Stealing Manager Manager Manager Manager Manager Compute Server Compute Server Compute Server
Hierarchical Work Stealing ... Manager keeps track of when its subtree is idle If manager’s subtree is idle, manager steals work from its siblings If a subtree has “too much” work, it “allows” work stealing from above What is definition & implementation of “too much”?
Hierarchical Work Stealing The authors claim that proven properties of Cilk hold in this hierarchical setting. Goals: Localize communication Sub-trees map to domain hierarchy Administrators can control thread migration: Outflow: Privacy Inflow: Host security
Examples Fib: fine grained threads POV-Ray: coarse grained threads Base 1 Node 3 Nodes 8 Nodes Fib (24) 1.3 80 40 (2.0) 31 (2.6) POV-Ray 20700 21000 - 2700 (7.8) Numbers in ( ) are speedups over 1-node case.
Examples ... POV-Ray is not written in Java Partitioning is done in Java 8 nodes: only 2% overhead. What about larger P?
Discussion Scalable: Yes. Heterogeneity: Incomplete until divorces itself from all native libraries. Safety: Java: OK. Native libraries: ?
Discussion ... Fault tolerance: A timed out thread is recomputed from a checkpoint maintained by subtree (manager?) What is affect on performance of checkpointing? Subtree rooted at a thread is its subcomputation.
Fault Tolerance ... Subcomputations are transactions: Authors claim: side effects can be undone How does this relate to hierarchical work stealing?
Discussion ... Anonymity: A host executing a stolen subtree cannot determine client. Managers are assumed to be trustworthy Hierarchy: Yes, via manager hierarchy. Ease of use: Interface incomplete. clients submit jobs via a special “shell”
Discussion ... Adaptive parallelism: “Owner” (?) of compute server sets a policy that defines when server is idle. How? When compute server becomes unavailable for Atlas work, all its sub-computations are moved to another computer server.
Adaptive Parallelism ... Moving a subcomputation requires updating information linking subcomputation to its: parent children How long does it take to retreat? Is sub-computation restarted? From checkpoint?
Limitations Atlas inherits tree-structured program limitation from Cilk. But this is still a rich set! Generalizing to non-tree-structured programs seems hard. No shared variables among threads. Global file system is read-only.
Conclusion Jicos design goals = those for Atlas. Use JXTA to give Jicos a “file system” Then, Jicos becomes Atlas’s heir.