Presentation is loading. Please wait.

Presentation is loading. Please wait.

Agent Teamwork Research Assistant

Similar presentations


Presentation on theme: "Agent Teamwork Research Assistant"— Presentation transcript:

1 Agent Teamwork Research Assistant
Evaluation of Agent Teamwork A High Performance Distributed Computing Middleware For my internship over that last two quarters I worked as an undergraduate research assistant on a joint project between UWB-Ehime University in Japan, called AgentTeamWork. The project is guided, here, by Professor Munehiro Fukuda, who was my advisor on the project. I’m going to start with a little background, then talk about the work I did and what I learned Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007

2 What is Agent Teamwork? HPDC Middleware Job Dispatch & Termination
Programming Framework Under Ongoing Development HPDC is the use of distributed computing and parallel processing to improve application performance Agent Teamwork provides two main functions Job Dispatch & Termination – dynamic node selection, file transfer for executable & data files, stoping, starting ,returning results Programming Framework – programmatic coordination of the distributed resources, supports distributed application development Development is performance focused – My job to evaluate performance

3 Project Objectives Evaluate Agent Teamwork’s performance against a contemporary alternative Job Dispatch & Termination Performance Framework Performance Build a Reference Platform Write 3 benchmark programs that exercise the framework My main objective was to evaluate Agent Teamworks Job Dispatch & Termination and Framework performance against current mainstream solutions. In order to do this I had to build a reference platform to evaluate it against and write 3 benchmark programs that would exercise the framework.

4 Job Dispatch & Termination Performance Evaluation
Globus Based Reference Platform Globus Toolkit OpenPBS scheduler MPICH-G2 No 1 to 1 match with Agent Teamwork, instead a number products providing different sets of services GTK, the defacto standard for grid computing GTK Just a toolkit – a complete reference platform required integration with openpbs and mpich

5 Reference Platform Hardware
66 computers divided into 2 clusters Agent Teamwork also runs on these same 66 computers Hardware is different between clusters but comparisons are apples to apples

6 Reference Platform Overview
Getting these components, built, installed, and configured to the point where I could run a distributed job I had to overcome a lot of challenges This diagram provide a detailed overview of how the reference platform worked. In order to run a job you generate a job definition file using the RSL and submit it along with your user certificate The globus run program parses the rsl and in the case of a multi cluster job, it uses the duroc library to coordinate a gram client for each cluster The gram client submits the job to a gatekeeper on the cluster head, which uses the GSI to authenticate and authorize the job submission. It then starts a job manager which issues a callback to the gram client to connect std error and std out back to the client The job manager then submits the job details to the pbs server which applies any policies to determine which queue to place the job in The pbs scheduler locates suitable nodes for the job and transfers the executable and any data files to the selected nodes PBS mom launches the application Applications written in the MPICH_G2 framework make use of duroc and the grid security infrastructure coordinate their cooperative parallel execution

7 Reference Platform Challenges
Administrator Access to Machines Host Config & Cryptic Error Messages DNS vs hosts files Inconsistent hosts files Inconsistent ptr records Inconsistent port acls : globus_init: failed GTK Authentication Wide variance in system configuration parameters The platform components had dependencies on these configurations Discovery required strace and or tcpdump or attaching gdb

8 Debugging Strace TcpDump GDB

9 Job Dispatching and Termination Function Evaluation
Not evaluating the job execution performance Methodology Ported available test program to the MPICH-G2 framework measure how long it takes a job submission to be deployed, executed and cleaned up Run with 2-64 nodes across the two clusters in a depth-first node distribution series and a breadth-first node distribution series Not evaluating job execution Needed a lightweight test program Ported professors test program to MPICH

10 Results 10 second stair step challenge
The reference platform performance numbers would cluster with occasional outliers. This was most pronounced in the breadth first runs making me suspect higher sensitivity to something network related AgentTeamwork is competitive

11 Results

12 Results

13 Framework Function Evaluation
Framework Issues Agent Teamwork MPI implementation MPICH-G2 C++ MPIJava MPI Framework Communication functions Initialization, Barrier, Broadcast, Gather, Scatter, etc. Goal to write 3 benchmark programs that have communication intensive algorithms. The second part of my effort was to evaluate the framework performance C++ vs java The Agent Teamwork framework provides an MPI implementation. However Agent Teamwork is written entirely in java whereas MPICH-G2 is a c++ framework. To avoid comparing frameworks in different languages I evaluated Agent Teamwork against the MPIJava framework which is a popular java MPI implementation

14 Benchmark Programs MD - a molecular dynamics simulation
Wave2D - a wave dissemination simulation Mandelbrot - a Mandelbrot generator Code each program twice Except for Mandelbrot which one the professor’s students had already coded in mpijava

15 Agent Teamwork Programming
Snapshots Programming model func_n int func_0 (String[] Args){ return 1; } int func_1 () { Code Maturity Agent Teamwork takes regular runtime snapshots of a program and is capable of migrating a running job from one node to another for load balancing and dynamic failure recovery. Java won’t serialize program counter and stack race conditions resulting in deadlocks and other framework bugs prevented completing framework evaluation

16 Partial Results

17 Partial Results 2 orders of magnitude slower – suspect related to snapshots and size of data set

18 Future Work Framework debugging
Develop a pre-processor to convert conventionally programmed code into the snapshot-able func_n model

19 Skills Developed During Project
Significant experience with globus, openPBS and the mpi Extensive debugging with tcpdump, strace, and gdb experience with performance analysis and writing MPI programs new insights and understanding of HPDC


Download ppt "Agent Teamwork Research Assistant"

Similar presentations


Ads by Google