Agent Teamwork Research Assistant Evaluation of Agent Teamwork A High Performance Distributed Computing Middleware For my internship over that last two quarters I worked as an undergraduate research assistant on a joint project between UWB-Ehime University in Japan, called AgentTeamWork. The project is guided, here, by Professor Munehiro Fukuda, who was my advisor on the project. I’m going to start with a little background, then talk about the work I did and what I learned Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007
What is Agent Teamwork? HPDC Middleware Job Dispatch & Termination Programming Framework Under Ongoing Development HPDC is the use of distributed computing and parallel processing to improve application performance Agent Teamwork provides two main functions Job Dispatch & Termination – dynamic node selection, file transfer for executable & data files, stoping, starting ,returning results Programming Framework – programmatic coordination of the distributed resources, supports distributed application development Development is performance focused – My job to evaluate performance
Project Objectives Evaluate Agent Teamwork’s performance against a contemporary alternative Job Dispatch & Termination Performance Framework Performance Build a Reference Platform Write 3 benchmark programs that exercise the framework My main objective was to evaluate Agent Teamworks Job Dispatch & Termination and Framework performance against current mainstream solutions. In order to do this I had to build a reference platform to evaluate it against and write 3 benchmark programs that would exercise the framework.
Job Dispatch & Termination Performance Evaluation Globus Based Reference Platform Globus Toolkit OpenPBS scheduler MPICH-G2 No 1 to 1 match with Agent Teamwork, instead a number products providing different sets of services GTK, the defacto standard for grid computing GTK Just a toolkit – a complete reference platform required integration with openpbs and mpich
Reference Platform Hardware 66 computers divided into 2 clusters Agent Teamwork also runs on these same 66 computers Hardware is different between clusters but comparisons are apples to apples
Reference Platform Overview Getting these components, built, installed, and configured to the point where I could run a distributed job I had to overcome a lot of challenges This diagram provide a detailed overview of how the reference platform worked. In order to run a job you generate a job definition file using the RSL and submit it along with your user certificate The globus run program parses the rsl and in the case of a multi cluster job, it uses the duroc library to coordinate a gram client for each cluster The gram client submits the job to a gatekeeper on the cluster head, which uses the GSI to authenticate and authorize the job submission. It then starts a job manager which issues a callback to the gram client to connect std error and std out back to the client The job manager then submits the job details to the pbs server which applies any policies to determine which queue to place the job in The pbs scheduler locates suitable nodes for the job and transfers the executable and any data files to the selected nodes PBS mom launches the application Applications written in the MPICH_G2 framework make use of duroc and the grid security infrastructure coordinate their cooperative parallel execution
Reference Platform Challenges Administrator Access to Machines Host Config & Cryptic Error Messages DNS vs hosts files Inconsistent hosts files Inconsistent ptr records Inconsistent port acls : globus_init: failed GTK Authentication Wide variance in system configuration parameters The platform components had dependencies on these configurations Discovery required strace and or tcpdump or attaching gdb
Debugging Strace TcpDump GDB
Job Dispatching and Termination Function Evaluation Not evaluating the job execution performance Methodology Ported available test program to the MPICH-G2 framework measure how long it takes a job submission to be deployed, executed and cleaned up Run with 2-64 nodes across the two clusters in a depth-first node distribution series and a breadth-first node distribution series Not evaluating job execution Needed a lightweight test program Ported professors test program to MPICH
Results 10 second stair step challenge The reference platform performance numbers would cluster with occasional outliers. This was most pronounced in the breadth first runs making me suspect higher sensitivity to something network related AgentTeamwork is competitive
Results
Results
Framework Function Evaluation Framework Issues Agent Teamwork MPI implementation MPICH-G2 C++ MPIJava MPI Framework Communication functions Initialization, Barrier, Broadcast, Gather, Scatter, etc. Goal to write 3 benchmark programs that have communication intensive algorithms. The second part of my effort was to evaluate the framework performance C++ vs java The Agent Teamwork framework provides an MPI implementation. However Agent Teamwork is written entirely in java whereas MPICH-G2 is a c++ framework. To avoid comparing frameworks in different languages I evaluated Agent Teamwork against the MPIJava framework which is a popular java MPI implementation
Benchmark Programs MD - a molecular dynamics simulation Wave2D - a wave dissemination simulation Mandelbrot - a Mandelbrot generator Code each program twice Except for Mandelbrot which one the professor’s students had already coded in mpijava
Agent Teamwork Programming Snapshots Programming model func_n int func_0 (String[] Args){ … return 1; } int func_1 () { Code Maturity Agent Teamwork takes regular runtime snapshots of a program and is capable of migrating a running job from one node to another for load balancing and dynamic failure recovery. Java won’t serialize program counter and stack race conditions resulting in deadlocks and other framework bugs prevented completing framework evaluation
Partial Results
Partial Results 2 orders of magnitude slower – suspect related to snapshots and size of data set
Future Work Framework debugging Develop a pre-processor to convert conventionally programmed code into the snapshot-able func_n model
Skills Developed During Project Significant experience with globus, openPBS and the mpi Extensive debugging with tcpdump, strace, and gdb experience with performance analysis and writing MPI programs new insights and understanding of HPDC