Inter-cluster Job Deployment by AgentTeamwork Sentinel Agents Emory Horvath CSS497 Spring 2006 Advisor: Dr. Munehiro Fukuda
What is Grid Computing? Grid Computing seeks to pool together large numbers of computers, allowing unused CPU cycles to be shared for CPU-intensive tasks. Examples: Condor Issues: Job coordination Security Software installation and maintenance Fault tolerance
What is AgentTeamwork? Portable Java-based grid computing platform, based on the mobile agent paradigm. Decentralized architecture, without a central manager. Easy installation and participation. Designed with fault tolerance in mind. Participating computers run a Java process (UWPlace). Each UWPlace can host one or more mobile agent Java processes (UWAgents). Central FTP server hosts the list of available computers.
How AgentTeamwork Works FTP Server User A User B User B snapshot snapshots User program wrapper Snapshot Methods GridTCP User program wrapper Snapshot Methods GridTCP User program wrapper Snapshot Methods GridTCP snapshot User A’s Process User A’s Process User B’s Process TCP Communication Commander Agent Sentinel Agent Resource Agent Sentinel Agent Resource Agent Bookkeeper Agent Results
How AgentTeamwork Works - 2 Operating systems UWAgents mobile agent execution platform Commander, resource, sentinel, bookkeeper agents User program wrapper GridTcpJava socket mpiJava-AmpiJava-S mpiJava API Java user applications
Single-Cluster Hierarchy User Commander id 0 Sentinel id 2 rank 0 Bookkeeper id 3 rank 0 Resource id 1 eXist Sentinel id 8 rank 1 Sentinel id 11 rank 4 Sentinel id 10 rank 3 Sentinel id 9 rank 2 Bookkeeper id 12 rank 1 Bookkeeper id 15 rank 4 Bookkeeper id 14 rank 3 Bookkeeper id 13 rank 2 Sentinel id 32 rank 5 Sentinel id 34 rank 7 Sentinel id 33 rank 6 Bookkeeper id 48 rank 5 Bookkeeper id 50 rank 7 Bookkeeper id 49 rank 6 Job Submission XML Query Spawn id: agent id rank: MPI Rank snapshot Sensor id 5
Single-Cluster Job Resumption Sentinel id 2 rank 0 Sentinel id 8 rank 1 Sentinel id 11 rank 4 Sentinel id 10 rank 3 Sentinel id 9 rank 2 Bookkeeper id 15 rank 4 (0) Send a new snapshot periodically MPI connections (2) Search for the latest snapshot (1) Detect a ping error Sentinel id 11 rank 4 New (4) Send a new agent (5) Notify about the restart (3) Retrieve the snapshot
Extending to Multiple Clusters The existing AgentTeamwork system allows only job deployment within a single intranet cluster. The primary focus of my project was to extend Agent Teamwork to allow job deployment and resumption across multiple clusters: Rewrite and extend existing AgentTeamwork algorithms to support multiple clusters. Rewrite job deployment code to deploy gateway tasks and remote-cluster jobs. Integrate new gateway-enabled Java socket functionality. Rewrite job-resumption code to resume failed remote clusters and remote compute nodes.
Sentinel id 131 rank 4 Sentinel id 32 rank 0 Sentinel id 130 rank 3 Sentinel id 129 rank 2 Sentinel id 512 rank 5 Sentinel id 128 rank 1 Cluster 0 Multiple-Cluster Hierarchy User Commander id 0 Sentinel id 2 Bookkeeper id 3 rank 0 Resource id 1 Sentinel id 8 rank -8 Cluster gateway 0 Sentinel id 531 rank 10 Sentinel id 33 rank -33 Sentinel id 132 rank 6 Sentinel id 530 rank 9 Sentinel id 529 rank 8 Sentinel id 528 rank 7 Cluster 1 Cluster gateway 1, Sentinel id 9 rank X Sentinel id 39 rank X+4 Sentinel id 38 rank X+3 Sentinel id 37 rank X+2 Sentinel id 36 rank X+1 Desktop computers Sentinel id 34 rank -34 Cluster 2 2, Sentinel id 35 rank -35 Cluster 3 and 3
Multiple-Cluster Job Resumption Sentinel id 131 rank 4 User Commander id 0 Sentinel id 2 Sentinel id 8 rank -8 Sentinel id 33 rank -33 Sentinel id 32 rank 0 Sentinel id 130 rank 3 Sentinel id 129 rank 2 Bookkeeper id 3 rank 0 Resource id 1 Sentinel id 512 rank 5 Sentinel id 128 rank 1 Cluster 0 Sentinel id 531 rank 10 Sentinel id 132 rank 6 Sentinel id 530 rank 9 Sentinel id 529 rank 8 Sentinel id 528 rank 7 Cluster 1 Cluster gateway 0 Cluster gateway 1 Desktop computers Extra Node Extra Node Compute Node Cluster Gateway Compute Node Compute Node Compute Node Compute Node Extra Cluster Extra Cluster gateway New Sentinel
Other Current & Ongoing Tasks AgentTeamwork is an ongoing project, with parallel contributions by many other team members: RMI to Java Socket enhancements, developed by Duncan Smith, were integrated. Agent file I/O enhancements (Jumpei Miyauchi), and sensor agent enhancements (Jun Morisaki) were also integrated. Although I am presenting now, I will be continuing on the project over the summer: Completion of inter-cluster fault tolerance and job redeployment. Completion of inter-cluster performance tests Assisting Cuong Ngo as needed with the implementation of dynamic resource allocation.
Acknowledgements Professor Fukuda, my advisor. NSF Middleware Initiative. The UW-Bothell CSS Program. Graphics and other slide content contributed by Prof. Fukuda from earlier AgentTeamwork presentations and papers.
Questions?