Atlas: An Infrastructure for Global Computing

Atlas: An Infrastructure for Global Computing

People Eric Baldeschwieler (UC Berkeley) Bobby Blumofe (UT Austin)
Eric Brewer (UC Berkeley)

Outline Introduction Programming model Architecture Examples
Discussion Limitations & Conclusion

Introduction Properties of a Internet computing infrastructure
Scalability: to 106 nodes Heterogeneity: of machines & OSs Fault tolerance: completion probability comparable to sequential program Adaptive parallelism: dynamic set of resources

Properties ... Safety: Hosts must be secure
Anonymity: Secure privacy of client: data & program Hierarchy: Locality of communication (local bandwidth typically is higher) Ease of use: Minimize “costs” of participating. Reasonable performance: Low overhead  Benefit from a small set of machines.

Introduction ... Atlas combines mechanisms from: Java “ensures”: Cilk
with new mechanisms. Java “ensures”: heterogeneity safety

Introduction ... Atlas: extends Cilk’s work-stealing scheduler to a hierarchical Internet setting uses Cilk-NOW’s mechanisms for: adaptive parallelism fault tolerance

Programming Model Applications are written in Java
When a native library is used, heterogeneity is limited to platforms that support it. Programming model is: a Java-based implementation of Cilk: Non-blocking, explicit continuation passing threads a Unix-like URL-based file system & local caching with coherence.

Native libraries (C or C++)
Architecture Basic architecture Compute Server Client Manager Application (Java) Runtime library Java interpreter Native libraries (C or C++) Compute Server Compute Server Compute Server

Architecture ... Client is a Java application
connects to compute servers on machines other than its manager’s. Idle servers steal work from busy ones.

Architecture Compute server:
relinquishes control when there is non-Atlas work (a screensaver?) Runs as a daemon: working pings manager & siblings for work to steal

Architecture: Porting Atlas
A Java runtime system Port: natively written URL-based file system some support routines.

Hierarchical Work Stealing
Manager Manager Manager Manager Manager Compute Server Compute Server Compute Server

Hierarchical Work Stealing ...
Manager keeps track of when its subtree is idle If manager’s subtree is idle, manager steals work from its siblings If a subtree has “too much” work, it “allows” work stealing from above What is definition & implementation of “too much”?

Hierarchical Work Stealing
The authors claim that proven properties of Cilk hold in this hierarchical setting. Goals: Localize communication Sub-trees map to domain hierarchy Administrators can control thread migration: Outflow: Privacy Inflow: Host security

Examples Fib: fine grained threads POV-Ray: coarse grained threads
Base 1 Node 3 Nodes 8 Nodes Fib (24) (2.0) 31 (2.6) POV-Ray (7.8) Numbers in ( ) are speedups over 1-node case.

Examples ... POV-Ray is not written in Java
Partitioning is done in Java 8 nodes: only 2% overhead. What about larger P?

Discussion Scalable: Yes.
Heterogeneity: Incomplete until divorces itself from all native libraries. Safety: Java: OK. Native libraries: ?

Discussion ... Fault tolerance: A timed out thread is recomputed from a checkpoint maintained by subtree (manager?) What is affect on performance of checkpointing? Subtree rooted at a thread is its subcomputation.

Fault Tolerance ... Subcomputations are transactions:
Authors claim: side effects can be undone How does this relate to hierarchical work stealing?

Discussion ... Anonymity: A host executing a stolen subtree cannot determine client. Managers are assumed to be trustworthy Hierarchy: Yes, via manager hierarchy. Ease of use: Interface incomplete. clients submit jobs via a special “shell”

Discussion ... Adaptive parallelism:
“Owner” (?) of compute server sets a policy that defines when server is idle. How? When compute server becomes unavailable for Atlas work, all its sub-computations are moved to another computer server.

Adaptive Parallelism ... Moving a subcomputation requires updating information linking subcomputation to its: parent children How long does it take to retreat? Is sub-computation restarted? From checkpoint?

Limitations Atlas inherits tree-structured program limitation from Cilk. But this is still a rich set! Generalizing to non-tree-structured programs seems hard. No shared variables among threads. Global file system is read-only.

Conclusion Jicos design goals = those for Atlas.
Use JXTA to give Jicos a “file system” Then, Jicos becomes Atlas’s heir.

Atlas: An Infrastructure for Global Computing

Similar presentations

Presentation on theme: "Atlas: An Infrastructure for Global Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Atlas: An Infrastructure for Global Computing

Similar presentations

Presentation on theme: "Atlas: An Infrastructure for Global Computing"— Presentation transcript:

Similar presentations

About project

Feedback